EDA
Distribution of
Individual Features
The distributions of the individual features in the NHANES dataset
provide a snapshot of the characteristics of the population in terms of
various health and demographic variables. Understanding these
distributions is essential for identifying patterns, detecting potential
biases, and assessing the range and variability of each feature. This
analysis helps in understanding how variables such as age, body weight,
blood pressure, and lifestyle factors are spread across the dataset,
which is crucial for subsequent statistical modeling and
interpretation.
# Convert variables to factors with proper labels
nhanes$sex <- factor(nhanes$sex, levels = c(0, 1), labels = c("Female", "Male"))
nhanes$race <- factor(nhanes$race, levels = c(1, 2, 3), labels = c("White", "Black", "Other"))
nhanes$past.smk <- factor(nhanes$past.smk, levels = c(1, 2), labels = c("Yes", "No"))
nhanes$current.smk <- factor(nhanes$current.smk, levels = c(1, 2), labels = c("Yes", "No"))
# Create frequency tables
sex_count <- table(nhanes$sex)
race_count <- table(nhanes$race)
past.smk_count <- table(nhanes$past.smk)
current.smk_count <- table(nhanes$current.smk)
par(mfrow = c(2, 2))
# Plot the bar charts
barplot(height = sex_count, col = "steelblue", main = "Sex", xlab = "Sex", ylab = "Count")
barplot(height = race_count, col = "steelblue", main = "Race", xlab = "Race", ylab = "Count")
barplot(height = past.smk_count, col = "steelblue", main = "Past Smoker", xlab = "Past Smoker", ylab = "Count")
barplot(height = current.smk_count, col = "steelblue", main = "Current Smoker", xlab = "Current Smoker", ylab = "Count")

There are slightly more male observations than female observation in
this data set. The majority of participants are white, followed by
black. All participants are recorded as being past smokers, which is
likely an error. There is an imbalance for past smoker. In turn, past
smoker cannot be used for analysis. About half of current participants
are current smokers.
par(mfrow = c(2, 2))
# Histogram for Cholesterol
hist(nhanes$cholestrol,
main = "Histogram of Cholesterol",
xlab = "Cholesterol",
col = "steelblue",
border = "black",
breaks = 20) # Control number of bins
# Histogram for Systolic BP
hist(nhanes$avg.sbp,
main = "Histogram of Systolic BP",
xlab = "Systolic BP",
col = "steelblue",
border = "black",
breaks = 20)
# Histogram for Body Weight
hist(nhanes$body.weight,
main = "Histogram of Body Weight",
xlab = "Body Weight",
col = "steelblue",
border = "black",
breaks = 20)
# Histogram for Height
hist(nhanes$body.weight,
main = "Histogram of Height",
xlab = "Height",
col = "steelblue",
border = "black",
breaks = 20)

Most participants do not have high blood pressure. The age of
participants varies from about 20 to about 90. The height distribution
is roughly symmetric. The body weight distribution is skewed right due
to some high outliers. The high outliers may indicate obesity.
par(mfrow = c(2,2))
hist(nhanes$avg.sbp, col = "steelblue",
main = "Systolic BP",
xlab = "Systolic BP", ylab = "Frequency",
breaks = 30)
hist(nhanes$avg.dbp, col = "steelblue",
main = "Diastolic BP",
xlab = "Diastolic BP", ylab = "Frequency",
breaks = 30)
hist(nhanes$cholestrol, col = "steelblue",
main = "Cholesterol",
xlab = "Cholesterol", ylab = "Frequency",
breaks = 30)

Systolic BP and cholesterol have skewed right distributions.
Diastolic BP is roughly symmetric. Of the 7927 participants, 1964 are
obese.
Relationship between
Features
Examining relationships between variables using scatterplots and a
correlation matrix provides a comprehensive approach to understanding
associations. Scatterplots visually reveal how two continuous variables
interact, allowing us to identify trends, correlations, and potential
outliers. Meanwhile, a correlation matrix quantifies the strength and
direction of relationships between multiple variables at once, helping
to pinpoint which pairs are strongly correlated. Together, scatterplots
and a correlation matrix offer valuable insights into the underlying
patterns in the data, guiding further statistical analysis and model
selection.
# Set seed for reproducibility
set.seed(123)
# Sample 500 rows (adjust the number as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]
# Create a pairwise scatterplot matrix with the sampled data
pairs(sampled_data,
main = "Pairwise Scatterplots (Bootstrapped Sample)",
pch = 19, # Shape of the points (circle)
col = rgb(0, 0, 1, 0.5) # Color with transparency (blue)
)

nhanes_subset <- nhanes[, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]
# Compute the correlation matrix
cor_matrix <- cor(nhanes_subset, use = "pairwise.complete.obs")
# Print the correlation matrix
print(cor_matrix)
age body.weight height avg.sbp avg.dbp cholestrol
age 1.00000000 -0.01377966 -0.08538971 0.55641362 0.1143690 0.2738272
body.weight -0.01377966 1.00000000 0.45314709 0.13693751 0.2900425 0.1032280
height -0.08538971 0.45314709 1.00000000 0.01379185 0.2060106 -0.0631896
avg.sbp 0.55641362 0.13693751 0.01379185 1.00000000 0.5213667 0.2306877
avg.dbp 0.11436905 0.29004249 0.20601064 0.52136671 1.0000000 0.1706567
cholestrol 0.27382718 0.10322800 -0.06318960 0.23068766 0.1706567 1.0000000
corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)

The most highly correlate features are age and average systolic blood
pressure, body weight and height, average systolic blood pressure and
average diastolic blood pressure, age and cholesterol, body weight and
cholesterol, and body weigh and average diastolic blood pressure. The
pairwise relationships are displayed with individual scatter plots.
# Set seed for reproducibility
set.seed(123)
# Bootstrap a smaller sample (adjust size as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, ]
# Set up a 2x3 plotting grid
par(mfrow = c(2, 3))
# Scatterplot for Age and Avg. Systolic Blood Pressure
plot(sampled_data$age, sampled_data$avg.sbp, main = "Age vs Avg. Systolic BP",
xlab = "Age", ylab = "Avg. Systolic BP", col = "blue", pch = 19)
# Scatterplot for Body Weight and Height
plot(sampled_data$body.weight, sampled_data$height, main = "Body Weight vs Height",
xlab = "Body Weight", ylab = "Height", col = "red", pch = 19)
# Scatterplot for Avg. Systolic Blood Pressure and Avg. Diastolic Blood Pressure
plot(sampled_data$avg.sbp, sampled_data$avg.dbp, main = "Avg. Systolic vs Avg. Diastolic BP",
xlab = "Avg. Systolic BP", ylab = "Avg. Diastolic BP", col = "forestgreen", pch = 19)
# Scatterplot for Age and Cholesterol
plot(sampled_data$age, sampled_data$cholestrol, main = "Age vs Cholesterol",
xlab = "Age", ylab = "Cholesterol", col = "purple", pch = 19)
# Scatterplot for Avg. Systolic BP and Cholesterol
plot(sampled_data$avg.sbp, sampled_data$cholestrol, main = "Avg Systolic BP vs Cholesterol",
xlab = "Cholesterol", ylab = "Avg.SBP", col = "gold", pch = 19)
# Scatterplot for Body Weight and Avg. Diastolic Blood Pressure
plot(sampled_data$body.weight, sampled_data$avg.dbp, main = "Body Weight vs Avg. Diastolic BP",
xlab = "Body Weight", ylab = "Avg. Diastolic BP", col = "maroon", pch = 19)

Six of the most correlated pairwise relationship are visualized in
scatter plots. All pairwise relationships are linear with no apparent
clustering.
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))
# Create a contingency table
sex_hbp_table <- table(nhanes$sex, nhanes$hbp)
# Generate mosaic plot
mosaicplot(sex_hbp_table,
col = c("gold", "royalblue"),
main = "Sex and High Blood Pressure",
xlab = "Sex Status",
ylab = "HBP Status",
border = "black")
# Create a contingency table
race_hbp_table <- table(nhanes$race, nhanes$hbp)
# Generate mosaic plot
mosaicplot(race_hbp_table,
col = c("gold", "royalblue"),
main = "Race and High Blood Pressure",
xlab = "Race",
ylab = "HBP Status",
border = "black")
# Create a contingency table
current.smk_hbp_table <- table(nhanes$current.smk, nhanes$hbp)
# Generate mosaic plot
mosaicplot(current.smk_hbp_table,
col = c("gold", "royalblue"),
main = "Current Smoking Status and HBP",
xlab = "Current Smoking Status",
ylab = "HBP Status",
border = "black")
# Create a contingency table
past.smk_hbp_table <- table(nhanes$past.smk, nhanes$hbp)
# Generate mosaic plot
mosaicplot(past.smk_hbp_table,
col = c("gold", "royalblue"),
main = "Past Smoking Status and HBP",
xlab = "Past Smoking Status",
ylab = "HBP Status",
border = "black")

Relationship between high blood pressure and race, sex, and current
smoking status can be visualized in the mosaic plots. There appears to
be an association between high current smoking and race with high blood
pressure. A less pronounced relationship can been seen between sex and
race with high blood pressure.
# Set up a 2x2 plotting grid
par(mfrow = c(1, 2))
# Create a contingency table
current.smk_race_table <- table(nhanes$race, nhanes$current.smk)
# Generate mosaic plot
mosaicplot(current.smk_race_table,
col = c("gold", "royalblue"),
main = "Race and Current Smoke Status",
xlab = "Race",
ylab = "Current Smoke",
border = "black")
# Create a contingency table
current.smk_sex_table <- table(nhanes$sex, nhanes$current.smk)
# Generate mosaic plot
mosaicplot(current.smk_sex_table,
col = c("gold", "royalblue"),
main = "Sex and Current Smoke Status",
xlab = "Sex",
ylab = "Current Smoke",
border = "black")

# Create a contingency table
past.smk_race_table <- table(nhanes$race, nhanes$past.smk)
An association exists between race and current smoking. Black
individuals tend to smoke more than white or other individuals. An
association exists between sex and current smoke status. Female tend to
smoke currently slight more than males in this population.
MISSING VALUES
Imputation for
Categorical Features
The distribution of missing values for is examined to determine any
patterns. This step is done to ensure that value are missing at
random.
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))
#sum(is.na(nhanes$race))
#mean(is.na(nhanes$race))
#table(nhanes$race, useNA = "ifany")
#ggplot(nhanes, aes(x = factor(race))) +
#geom_bar(fill = "blue") +
#labs(title = "Distribution of Race (Before Imputation)", x = "Race", y = "Count") +
#theme_minimal()
aggr(nhanes, col = c("navyblue", "red"), numbers = TRUE, sortVars = TRUE,
labels = names(nhanes), cex.axis = 0.7, gap = 2,
ylab = c("Missing Data Overview", "Pattern of Missing Data"))

Variables sorted by number of missings:
Variable Count
race 0.1499937
cholestrol 0.1499937
body.weight 0.1400278
X 0.0000000
obs 0.0000000
psu 0.0000000
stratum 0.0000000
stat.weight 0.0000000
age 0.0000000
sex 0.0000000
height 0.0000000
avg.sbp 0.0000000
avg.dbp 0.0000000
past.smk 0.0000000
current.smk 0.0000000
hbp 0.0000000
A pattern of missing values exists between body weight, BMI, and
obese. This pattern can be explained by the fact that body weight is
used to calculate BMI, which is then used to classify observations for
obese. However, the pattern of missing value appears to be random for
the other features.
No pattern of missing values exists between race and other features,
In turn, kNN imputation is used for missing values of race.
sum(is.na(nhanes$race))
[1] 1189
table(nhanes$race, useNA = "ifany") # Check distribution including NAs
White Black Other <NA>
4693 1858 187 1189
# Perform KNN Imputation (single imputation)
nhanes_imputed <- kNN(nhanes, variable = "race", k = 3)
# Keep only the original columns (without extra columns added by kNN)
nhanes_imputed <- nhanes_imputed[, names(nhanes)]
# COMPARE before and after imputation distribution
par(mfrow = c(1,2)) # Split plotting area into two
# Convert race to a factor for better visualization
nhanes$race <- as.factor(nhanes$race)
nhanes_imputed$race <- as.factor(nhanes_imputed$race)
# Create a new column to indicate whether the data is original or imputed
nhanes$source <- "Before Imputation"
nhanes_imputed$source <- "After Imputation"
# Combine both datasets
nhanes_compare <- bind_rows(nhanes, nhanes_imputed)
# Plot the distribution before and after imputation
ggplot(nhanes_compare, aes(x = race, fill = source)) +
geom_bar(position = "dodge") +
labs(title = "Distribution of Race Before and After Imputation",
x = "Race",
y = "Count",
fill = "Dataset") +
theme_minimal()

imputed_nhanes <- nhanes_imputed
The overall distribution of race after imputation changed slightly
after imputation.
Regression-Based
Imputation for Numerical Feature
The pattern of missing values for body weight and cholesterol is
examined for any potential patterns.
#vis_miss(nhanes[, c("body.weight", "cholestrol")])
# Set smaller font size for the plot
par(cex = 0.4) # Adjust this value for desired font size
# Generate the missing data pattern plot
md.pattern(imputed_nhanes)

X obs psu stratum stat.weight age sex race height avg.sbp avg.dbp past.smk
5792 1 1 1 1 1 1 1 1 1 1 1 1
1025 1 1 1 1 1 1 1 1 1 1 1 1
946 1 1 1 1 1 1 1 1 1 1 1 1
164 1 1 1 1 1 1 1 1 1 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0
current.smk hbp source body.weight cholestrol
5792 1 1 1 1 1 0
1025 1 1 1 1 0 1
946 1 1 1 0 1 1
164 1 1 1 0 0 2
0 0 0 1110 1189 2299
# Reset graphics settings (optional)
par(cex = 1)
A pattern of missing values exists between body weight, BMI, and
obese because
Body weight is correlated with height, average diastolic blood
pressure, and sex. In turn, the missing body weight values will be
imputed using a regression model using said explanatory features.
# Subset nhanes to contain only complete observations for selected variables
complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("body.weight", "height", "avg.dbp", "sex")]), ]
# Fit the linear regression model
model_body.weight <- lm(body.weight ~ height + avg.dbp + sex, data = complete_nhanes)
# View the model summary
summary(model_body.weight)
Call:
lm(formula = body.weight ~ height + avg.dbp + sex, data = complete_nhanes)
Residuals:
Min 1Q Median 3Q Max
-86.131 -22.276 -4.259 17.073 269.863
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -207.90214 9.27828 -22.407 < 2e-16 ***
height 4.86582 0.14277 34.080 < 2e-16 ***
avg.dbp 0.76728 0.03913 19.606 < 2e-16 ***
sexMale -7.00000 1.07917 -6.486 9.4e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 33.52 on 6813 degrees of freedom
Multiple R-squared: 0.2495, Adjusted R-squared: 0.2491
F-statistic: 754.9 on 3 and 6813 DF, p-value: < 2.2e-16
# Plot the residuals
plot(model_body.weight$residuals)

The residual plot shows no patterns. In turn, the model fit the data
well and the assumption of constant variance of residuals is
satisfied.
# Create a copy of the dataset to work with
#imputed_nhanes <- complete_nhanes
colSums(is.na(imputed_nhanes))
X obs psu stratum stat.weight age
0 0 0 0 0 0
sex race body.weight height avg.sbp avg.dbp
0 0 1110 0 0 0
past.smk current.smk cholestrol hbp source
0 0 1189 0 0
# Identify the rows with missing values for body.weight
missing_body_weight <- is.na(imputed_nhanes$body.weight)
# Use the model to predict the missing values
imputed_nhanes$body.weight[missing_body_weight] <- predict(model_body.weight,
newdata = imputed_nhanes[missing_body_weight, c("height", "avg.dbp", "sex", "age")])
# Check the dataset
#head(imputed_nhanes)
# Create a plot to overlay the density curves
ggplot() +
geom_density(data = complete_nhanes, aes(x = body.weight), fill = "blue", alpha = 0.2, na.rm = TRUE) +
geom_density(data = imputed_nhanes, aes(x = body.weight), fill = "red", alpha = 0.2, na.rm = TRUE) +
labs(title = "Density Plot of Body Weight: Before and After Imputation",
x = "Body Weight", y = "Density") +
theme_minimal() +
scale_fill_manual(name = "Body Weight", values = c("blue" = "blue", "red" = "red"),
labels = c("Before Imputation", "After Imputation"))

The distribution before and after imputation is similar. In turn, the
imputation did not change the distribution of body weight.
The features that are most correlated to cholesterol are age, average
systolic blood pressure, and body weight. In turn, the missing values of
cholesterol will be imputed with a model using these features.
# Subset nhanes to include only rows with complete cases for cholestrol, age, avg.sbp, and BMI
complete_nhanes <- imputed_nhanes[!is.na(imputed_nhanes$cholestrol) & !is.na(imputed_nhanes$age) &
!is.na(imputed_nhanes$avg.sbp) & !is.na(imputed_nhanes$body.weight), ]
complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("cholestrol", "age", "avg.sbp", "body.weight")]), ]
# Fit the linear regression model to predict cholestrol
model_cholestrol <- lm(cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
summary(model_cholestrol)
Call:
lm(formula = cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
Residuals:
Min 1Q Median 3Q Max
-162.16 -28.18 -3.08 24.08 498.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.10667 3.96327 34.090 < 2e-16 ***
age 0.54613 0.03465 15.762 < 2e-16 ***
avg.sbp 0.21908 0.03267 6.707 2.15e-11 ***
body.weight 0.10341 0.01437 7.198 6.76e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.59 on 6734 degrees of freedom
Multiple R-squared: 0.09077, Adjusted R-squared: 0.09037
F-statistic: 224.1 on 3 and 6734 DF, p-value: < 2.2e-16
# View the model summary
summary(model_cholestrol)
Call:
lm(formula = cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
Residuals:
Min 1Q Median 3Q Max
-162.16 -28.18 -3.08 24.08 498.54
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.10667 3.96327 34.090 < 2e-16 ***
age 0.54613 0.03465 15.762 < 2e-16 ***
avg.sbp 0.21908 0.03267 6.707 2.15e-11 ***
body.weight 0.10341 0.01437 7.198 6.76e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 42.59 on 6734 degrees of freedom
Multiple R-squared: 0.09077, Adjusted R-squared: 0.09037
F-statistic: 224.1 on 3 and 6734 DF, p-value: < 2.2e-16
# Create a copy of nhanes for imputation
#imputed_nhanes <- nhanes
# Identify rows with missing cholestrol values
missing_cholestrol <- is.na(imputed_nhanes$cholestrol)
# Use the model to predict the missing cholestrol values
imputed_nhanes$cholestrol[missing_cholestrol] <- predict(model_cholestrol,
newdata = imputed_nhanes[missing_cholestrol, c("age", "avg.sbp", "body.weight")])
All explanatory features are significant.
# Get residuals from the model for complete_nhanes
residuals_cholestrol <- residuals(model_cholestrol)
# Predicted values for complete_nhanes
predicted_cholestrol <- predict(model_cholestrol, newdata = complete_nhanes)
# Create a data frame for residuals and predicted values
residuals_data <- data.frame(
predicted_cholestrol = predicted_cholestrol,
residuals_cholestrol = residuals_cholestrol
)
# Create the residual plot
library(ggplot2)
ggplot(residuals_data, aes(x = predicted_cholestrol, y = residuals_cholestrol)) +
geom_point(color = "blue") + # Plot the residuals
geom_hline(yintercept = 0, linetype = "dashed", color = "red") + # Add a horizontal line at 0
labs(title = "Residual Plot for Cholestrol Prediction",
x = "Predicted Cholestrol",
y = "Residuals") +
theme_minimal()

# Create the overlay density plot
ggplot() +
geom_density(data = nhanes, aes(x = cholestrol), fill = "blue", alpha = 0.2, na.rm = TRUE) + # Before imputation
geom_density(data = imputed_nhanes, aes(x = cholestrol), fill = "red", alpha = 0.2, na.rm = TRUE) + # After imputation
labs(title = "Density Plot of Cholestrol: Before and After Imputation",
x = "Cholestrol", y = "Density") +
theme_minimal() +
scale_fill_manual(name = "Cholestrol", values = c("blue" = "blue", "red" = "red"),
labels = c("Before Imputation", "After Imputation"))

The distribution of cholesterol did not change significantly after
imputation. As well, the residual plot does not display any patterns. In
turn, the model fit the data well and the assumption of constant
variance of residuals is satisfied.
Feature Engineer
In turn a feature which calculates BMI is created to determine
observations are classified as obese. According to the CDC (https://www.cdc.gov/nccdphp/dnpao/data-trends-maps/help/npao_dtm/definitions.html),
Adult obesity is defined as body mass index (BMI) ≥ 30.0. Another
feature named obese, based on BMI, is created, where 0 = no and 1 =
yes.
# Create BMI feature
imputed_nhanes$BMI <- (imputed_nhanes$body.weight * 703) / (imputed_nhanes$height * imputed_nhanes$height)
# Round the BMI to the nearest ones place
imputed_nhanes$BMI <- round(imputed_nhanes$BMI, 0)
# Create the obese feature based on BMI
imputed_nhanes$obese <- ifelse(imputed_nhanes$BMI >= 30, 1, 0)
par(mfrow = c(1, 2))
# Histogram for BMI
hist(imputed_nhanes$BMI,
main = "Histogram of BMI",
xlab = "BMI",
col = "steelblue",
border = "black",
breaks = 20)
# bar chart for obese
nhanes$obese <- factor(imputed_nhanes$obese, levels = c(0, 1), labels = c("No", "Yes"))
obese_count <- table(imputed_nhanes$obese)
barplot(height = obese_count, col = "steelblue", main = "Obesity", xlab = "Obesity", ylab = "Count")

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))
# Create a contingency table
obese_hbp_table <- table(imputed_nhanes$obese, imputed_nhanes$hbp)
# Generate mosaic plot
mosaicplot(obese_hbp_table,
col = c("gold", "royalblue"),
main = "Obesity and High Blood Pressure",
xlab = "Obesity Status",
ylab = "HBP Status",
border = "black")
# Create a contingency table
obese_sex_table <- table(imputed_nhanes$sex, imputed_nhanes$obese)
# Generate mosaic plot
mosaicplot(obese_sex_table,
col = c("gold", "royalblue"),
main = "Sex and Obesity",
xlab = "Obesity Status",
ylab = "Sex",
border = "black")
# Create a contingency table
obese_race_table <- table(imputed_nhanes$race, imputed_nhanes$obese)
# Generate mosaic plot
mosaicplot(obese_race_table,
col = c("gold", "royalblue"),
main = "Race and Obesity",
xlab = "Obesity Status",
ylab = "Race",
border = "black")
# Create a contingency table
obese_smoke_table <- table(imputed_nhanes$current.smk, imputed_nhanes$obese)
# Generate mosaic plot
mosaicplot(obese_smoke_table,
col = c("gold", "royalblue"),
main = "Current Smoking Status and Obesity",
xlab = "Obesity Status",
ylab = "Current Smoking Status",
border = "black")

Multiple
Imputation
Multiple Imputation by Chained Equations (MICE) is a powerful
technique used to handle missing data by creating multiple plausible
imputations for each missing value based on relationships in the
observed data. For body weight, cholesterol, and race, MICE can be
applied to generate accurate imputations, accounting for correlations
between these variables and other features in the dataset. This method
allows for more reliable statistical analyses by reducing bias and
preserving variability in the imputed values, making it especially
useful for handling missing data in incoming new data sets.
nhanes1$race <- as.factor(nhanes1$race)
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)
nhanes1$body.weight <- as.numeric(nhanes1$body.weight)
nhanes2 <- nhanes1
# 1 initialization
init <- mice(nhanes2, maxit=0)
print(init$method)
X obs psu stratum stat.weight age
"" "" "" "" "" ""
sex race body.weight height avg.sbp avg.dbp
"" "polyreg" "pmm" "" "" ""
past.smk current.smk cholestrol hbp
"" "" "pmm" ""
# 2 imputation models
imp <- mice(nhanes2, method = c("", "", "", "", "", "", "", "polyreg", "pmm", "", "", "", "", "", "pmm", ""),
maxit=10,
m=1,
seed = 123,
print=F)
complete(imp, action = 1)[1:10,]
X obs psu stratum stat.weight age sex race body.weight height avg.sbp
1 3 9 2 43 19451.83 48 0 1 149.7 61.8 131
2 5 11 2 40 1245.52 48 1 1 155.3 66.2 120
3 6 19 1 35 3860.97 44 1 2 189.6 70.2 133
4 7 34 1 13 5031.70 42 0 2 125.8 62.6 100
5 10 48 1 24 26919.29 56 0 1 239.9 67.6 128
6 12 51 1 28 2730.56 44 1 1 317.9 71.1 130
7 15 55 2 44 804.03 48 1 1 245.6 68.0 155
8 19 70 1 9 2629.39 63 1 2 202.9 67.8 137
9 20 71 1 43 1147.32 37 0 1 151.7 61.7 128
10 22 73 1 29 3991.81 42 1 3 205.0 68.4 148
avg.dbp past.smk current.smk cholestrol hbp
1 73 1 2 236 0
2 70 1 2 260 0
3 85 1 1 187 0
4 67 1 1 216 0
5 73 1 2 156 0
6 86 1 1 162 0
7 91 1 1 212 1
8 68 1 1 186 0
9 70 1 1 209 0
10 83 1 1 267 1
# 3 multiple (iterative) imputations
imp5 <- mice(nhanes2, method = "pmm", m = 5, maxit = 10, seed = 123, print=F)
plot(imp5)

Multiple Imputation by Chained Equations (MICE) will be applied to
handle missing values in the body weight variable. To assess the quality
of the imputations, the Mean Squared Error (MSE) will be calculated and
compared the MSE of the single imputation conducted for body weight
earlier in this analysis, providing a measure of the imputation
accuracy.
model5_bodyweight <- with(imp5, lm(body.weight ~ height + avg.dbp + sex))
summary.stats = summary(model5_bodyweight)
summary(pool(model5_bodyweight))
term estimate std.error statistic df p.value
1 (Intercept) -205.7045484 8.74935482 -23.510825 2473.3145 1.747163e-110
2 height 4.8466598 0.13536721 35.803796 1634.9416 8.934110e-208
3 avg.dbp 0.7523449 0.03941721 19.086712 157.8342 7.501512e-43
4 sex -6.6078962 1.06330598 -6.214482 276.0479 1.892089e-09
beta = summary.stats$estimate[seq(1,15,by=3)] # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)
pool.se.intercept
[1,] 123.2177
The model can be used to predict body weight for incoming new
data.
The MSE for single body weight imputation used earlier in this
analysis is MSE = (RSE)^2 = (33.52)^2 = 1123.6 which is substantially
higher than MSE when using MICE. In turn, MICE is recommended in this
analysis.
Multiple Imputation by Chained Equations (MICE) will be applied to
handle missing values in the cholesterol variable. To assess the quality
of the imputations, the Mean Squared Error (MSE) will be calculated and
compared the MSE of the single imputation conducted for cholesterol
earlier in this analysis, providing a measure of the imputation
accuracy.
model5_chol <- with(imp5, lm(cholestrol ~ age + avg.sbp))
summary.stats = summary(model5_chol)
summary(pool(model5_chol))
term estimate std.error statistic df p.value
1 (Intercept) 148.6912121 3.65965489 40.629845 74.24158 1.827860e-52
2 age 0.5233278 0.03407744 15.357017 236.74859 2.129193e-37
3 avg.sbp 0.2585339 0.03425598 7.547118 63.62530 2.094322e-10
beta = summary.stats$estimate[seq(1,15,by=3)] # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)
pool.se.intercept
[1,] 3.659655
The model can be used to predict cholesterol for incoming new
data.
The Mean Squared Error (MSE) from the Residual Standard Error (RSE)
given in the model output from the cholesterol single imputation method
used early in this analysis is MSE = (42.59)^2 = 1813.9. Thus, the Mean
Squared Error (MSE) = 1813.9.
The MSE using MICE is 3.66 which is substantially less. Therefore,
MICE is recommended in the analysis of this data set.
For the missing values in the race variable, Multiple Imputation by
Chained Equations (MICE) will be applied using multinomial logistic
regression. This method will account for the categorical nature of the
race variable, allowing for appropriate imputation based on the
relationships with other features in the dataset.
model_race <- with(imp5, multinom(race ~ avg.dbp + avg.sbp + cholestrol + hbp))
# weights: 18 (10 variable)
initial value 8708.699612
iter 10 value 5523.235718
final value 5495.930272
converged
# weights: 18 (10 variable)
initial value 8708.699612
iter 10 value 5555.595871
final value 5522.606318
converged
# weights: 18 (10 variable)
initial value 8708.699612
iter 10 value 5555.946819
final value 5529.649391
converged
# weights: 18 (10 variable)
initial value 8708.699612
iter 10 value 5560.173296
final value 5529.439901
converged
# weights: 18 (10 variable)
initial value 8708.699612
iter 10 value 5542.550592
final value 5510.127472
converged
# Pool the results from the imputed datasets
pooled_race <- pool(model_race)
# Get a summary of the pooled results
summary(pooled_race)
term estimate std.error statistic df p.value
1 (Intercept) -1.277932787 0.3325448882 -3.8428881 49.56081 3.468517e-04
2 avg.dbp 0.030728414 0.0030785469 9.9814670 343.89941 9.007490e-21
3 avg.sbp -0.009072042 0.0025775079 -3.5196952 252.21602 5.124609e-04
4 cholestrol -0.004119817 0.0007738102 -5.3240668 28.44201 1.091924e-05
5 hbp 0.148914062 0.1127433356 1.3208236 101.89054 1.895191e-01
6 (Intercept) -2.575346290 0.9149529588 -2.8147308 40.55536 7.498473e-03
7 avg.dbp 0.019877690 0.0080743228 2.4618399 3279.39165 1.387374e-02
8 avg.sbp -0.007708776 0.0071499482 -1.0781583 208.24247 2.822104e-01
9 cholestrol -0.005554712 0.0020814684 -2.6686508 36.04555 1.134283e-02
10 hbp -0.151093362 0.3197542727 -0.4725296 109.96948 6.374852e-01
The model can be used to predict race for incoming new data.
FEATURE SELECTION
Feature selection is a critical step in model building, as it helps
improve model performance by identifying the most relevant predictors
and reducing overfitting. In this analysis, both model-based selection
and regularization-based selection techniques will be used to assess and
retain the most influential features for accurate prediction, ensuring a
more efficient and interpretable model.
A full model is created to predict cholesterol.
#sapply(imputed_nhanes, function(x) length(unique(x)))
#head(imputed_nhanes, 25)
#head(imputed_nhanes)
model_full <- lm(cholestrol ~ age + race + BMI + height + body.weight +
hbp + avg.dbp + avg.sbp + obese,
data = imputed_nhanes)
# Summary of the full model
#summary(model_full)
# Tidy the model summary
model_summary <- tidy(model_full)
# Calculate Mean Squared Error (MSE)
mse <- mean(model_full$residuals^2)
# Calculate R-squared
r2 <- summary(model_full)$r.squared
# Create a data frame for MSE and R-squared
metrics_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"),
Value = c(mse, r2))
# Print the model summary using kable
kable(model_summary, digits = 3, caption = "Regression Model Summary")
Regression Model Summary
| (Intercept) |
0.043 |
0.017 |
2.447 |
0.014 |
| age |
0.250 |
0.013 |
18.847 |
0.000 |
| raceBlack |
-0.068 |
0.024 |
-2.781 |
0.005 |
| raceOther |
-0.206 |
0.067 |
-3.058 |
0.002 |
| BMI |
-0.122 |
0.082 |
-1.477 |
0.140 |
| height |
-0.190 |
0.049 |
-3.862 |
0.000 |
| body.weight |
0.266 |
0.093 |
2.847 |
0.004 |
| hbp |
-0.066 |
0.042 |
-1.574 |
0.116 |
| avg.dbp |
0.116 |
0.013 |
8.774 |
0.000 |
| avg.sbp |
0.054 |
0.021 |
2.597 |
0.009 |
| obese |
-0.022 |
0.038 |
-0.589 |
0.556 |
# Print MSE and R-squared using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics")
Model Performance Metrics
| Mean Squared Error |
0.876 |
| R-squared |
0.124 |
Stepwise model selection is conducted to retain significant
predictors. Selection processes based on AIC and BIC are conducted.
# Stepwise model selection based on AIC
model_stepwise_AIC <- step(model_full, direction = "both", trace = 0)
# Stepwise model selection based on BIC
model_stepwise_BIC <- step(model_full, direction = "both", k = log(nrow(imputed_nhanes)), trace = 0)
# Tidy the summaries for both models
stepwise_AIC_summary <- tidy(model_stepwise_AIC)
stepwise_BIC_summary <- tidy(model_stepwise_BIC)
# Compute MSE for both models
mse_AIC <- mean(model_stepwise_AIC$residuals^2)
mse_BIC <- mean(model_stepwise_BIC$residuals^2)
# Compute R-squared for both models
r2_AIC <- summary(model_stepwise_AIC)$r.squared
r2_BIC <- summary(model_stepwise_BIC)$r.squared
# Create data frames for MSE and R-squared
metrics_AIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"),
Value = c(mse_AIC, r2_AIC))
metrics_BIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"),
Value = c(mse_BIC, r2_BIC))
# Print the stepwise AIC model summary
kable(stepwise_AIC_summary, digits = 3, caption = "Stepwise AIC Regression Model Summary")
Stepwise AIC Regression Model Summary
| (Intercept) |
0.038 |
0.015 |
2.462 |
0.014 |
| age |
0.250 |
0.013 |
18.875 |
0.000 |
| raceBlack |
-0.068 |
0.024 |
-2.792 |
0.005 |
| raceOther |
-0.206 |
0.067 |
-3.062 |
0.002 |
| BMI |
-0.127 |
0.082 |
-1.553 |
0.121 |
| height |
-0.188 |
0.049 |
-3.842 |
0.000 |
| body.weight |
0.264 |
0.093 |
2.829 |
0.005 |
| hbp |
-0.067 |
0.042 |
-1.588 |
0.112 |
| avg.dbp |
0.117 |
0.013 |
8.778 |
0.000 |
| avg.sbp |
0.054 |
0.021 |
2.604 |
0.009 |
# Print performance metrics for AIC model
kable(metrics_AIC_df, digits = 3, caption = "Model Performance Metrics (AIC)")
Model Performance Metrics (AIC)
| Mean Squared Error |
0.876 |
| R-squared |
0.124 |
# Print the stepwise BIC model summary
kable(stepwise_BIC_summary, digits = 3, caption = "Stepwise BIC Regression Model Summary")
Stepwise BIC Regression Model Summary
| (Intercept) |
0.000 |
0.011 |
0.000 |
1 |
| age |
0.271 |
0.011 |
25.433 |
0 |
| height |
-0.116 |
0.012 |
-9.640 |
0 |
| body.weight |
0.122 |
0.012 |
9.908 |
0 |
| avg.dbp |
0.129 |
0.011 |
11.519 |
0 |
# Print performance metrics for BIC model
kable(metrics_BIC_df, digits = 3, caption = "Model Performance Metrics (BIC)")
Model Performance Metrics (BIC)
| Mean Squared Error |
0.879 |
| R-squared |
0.121 |
# Check R-squared value
summary(model_stepwise_AIC)$r.squared
[1] 0.123777
summary(model_stepwise_BIC)$r.squared
[1] 0.1210105
Both AIC and BIC stepwise model selection processes have similar R
squared values. The BIC stepwise model is parsimonious. This model
selects age, height, body weight, and average diastolic blood pressure
as predictors for cholesterol.
Regularization-based selection techniques are useful for improving
model performance and preventing overfitting, especially when you have
many predictors. Two common regularization methods are Ridge regression
and Lasso regression.
# Extract the response and predictors
y <- imputed_nhanes$cholestrol # Response variable
X <- as.matrix(imputed_nhanes[, c("age", "BMI", "height", "body.weight", "avg.dbp", "avg.sbp")])
# Optional: Scale predictors (recommended for Ridge and Lasso)
X_scaled <- scale(X)
# Fit Ridge and Lasso models
ridge_model <- glmnet(X_scaled, y, alpha = 0) # Ridge Regression (alpha = 0)
lasso_model <- glmnet(X_scaled, y, alpha = 1) # Lasso Regression (alpha = 1)
# Ridge cross-validation
cv_ridge <- cv.glmnet(X_scaled, y, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min
# Lasso cross-validation
cv_lasso <- cv.glmnet(X_scaled, y, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min
# Extract coefficients for Ridge and Lasso at optimal lambda
ridge_coefficients <- as.data.frame(as.matrix(coef(cv_ridge, s = "lambda.min")))
colnames(ridge_coefficients) <- c("Coefficient")
ridge_coefficients$Predictor <- rownames(ridge_coefficients)
lasso_coefficients <- as.data.frame(as.matrix(coef(cv_lasso, s = "lambda.min")))
colnames(lasso_coefficients) <- c("Coefficient")
lasso_coefficients$Predictor <- rownames(lasso_coefficients)
# Predictions for Ridge
pred_ridge <- predict(cv_ridge, X_scaled, s = "lambda.min")
# Predictions for Lasso
pred_lasso <- predict(cv_lasso, X_scaled, s = "lambda.min")
# Calculate RMSE for Ridge
rmse_ridge <- sqrt(mean((pred_ridge - y)^2))
# Calculate RMSE for Lasso
rmse_lasso <- sqrt(mean((pred_lasso - y)^2))
# Calculate R² for Ridge
rss_ridge <- sum((pred_ridge - y)^2) # Residual Sum of Squares
tss_ridge <- sum((y - mean(y))^2) # Total Sum of Squares
r2_ridge <- 1 - rss_ridge / tss_ridge
# Calculate R² for Lasso
rss_lasso <- sum((pred_lasso - y)^2)
tss_lasso <- sum((y - mean(y))^2)
r2_lasso <- 1 - rss_lasso / tss_lasso
# Create a data frame for RMSE and R²
metrics_df <- data.frame(
Model = c("Ridge", "Lasso"),
RMSE = c(rmse_ridge, rmse_lasso),
R2 = c(r2_ridge, r2_lasso)
)
# Print Ridge coefficients using kable
kable(ridge_coefficients, digits = 3, caption = "Ridge Regression Coefficients")
Ridge Regression Coefficients
| (Intercept) |
0.000 |
(Intercept) |
| age |
0.245 |
age |
| BMI |
0.041 |
BMI |
| height |
-0.086 |
height |
| body.weight |
0.071 |
body.weight |
| avg.dbp |
0.109 |
avg.dbp |
| avg.sbp |
0.039 |
avg.sbp |
# Print Lasso coefficients using kable
kable(lasso_coefficients, digits = 3, caption = "Lasso Regression Coefficients")
Lasso Regression Coefficients
| (Intercept) |
0.000 |
(Intercept) |
| age |
0.255 |
age |
| BMI |
0.000 |
BMI |
| height |
-0.113 |
height |
| body.weight |
0.120 |
body.weight |
| avg.dbp |
0.114 |
avg.dbp |
| avg.sbp |
0.030 |
avg.sbp |
# Print RMSE and R² values using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics (RMSE & R²)")
Model Performance Metrics (RMSE & R²)
| Ridge |
0.937 |
0.121 |
| Lasso |
0.937 |
0.121 |
The Ridge and Lasso method both had similar R squared values.
Therefore, both models are recommended.
LINEAR REGRESSION
In this regression analysis of the NHANES dataset, three models are
developed using cross validation to predict the outcome variable
cholesterol: a full model including all selected predictors, a stepwise
BIC model for variable selection, and a Ridge regression model to
address multicollinearity. Each model was evaluated based on its
predictive performance, with Mean Squared Error (MSE) and R-squared
(R^2) used as key comparison metrics. By analyzing these models, the aim
is to determine the most accurate and efficient approach for modeling
the data while balancing complexity and predictive power.
set.seed(42)
train_control <- trainControl(method = "cv", number = 10)
Full Model
The full model includes all selected predictor variables, providing a
comprehensive approach to estimating cholesterol levels. To assess its
stability, we implement k-fold cross-validation, which systematically
partitions the dataset into multiple training and validation subsets.
This approach mitigates the risk of overfitting by testing the model’s
performance on unseen data, yielding a robust estimate of prediction
error. The model’s effectiveness is quantified using the Mean Squared
Error (MSE), which measures the average squared difference between
observed and predicted cholesterol values. By employing
cross-validation, we ensure that the full model is evaluated rigorously,
providing insights into its accuracy and reliability in real-world
applications.
# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10) # 10-fold CV
# Fit the full model
full_model <- train(cholestrol ~ age + race + log_BMI + height + log_body.weight +
hbp + avg.dbp + log_avg.sbp + obese,
data = imputed_nhanes,
method = "lm",
trControl = train_control)
# Extract coefficients from the final model
coeff_table <- summary(full_model$finalModel)$coefficients %>%
as.data.frame() %>%
rownames_to_column(var = "Predictor") # Convert row names to a column
# Calculate Cross-Validation MSE
mse <- mean(full_model$resample$RMSE^2)
# Calculate R² from the final model
r2 <- summary(full_model$finalModel)$r.squared
# Create a data frame for MSE and R²
metrics_df <- data.frame(
Statistic = c("Mean Squared Error", "R-squared"),
Value = c(mse, r2)
)
# Print the coefficients using kable
kable(coeff_table, digits = 4, caption = "Regression Model Coefficients", format = "markdown")
Regression Model Coefficients
| (Intercept) |
-4.5758 |
1.7615 |
-2.5977 |
0.0094 |
| age |
0.2434 |
0.0134 |
18.1921 |
0.0000 |
| raceBlack |
-0.0679 |
0.0243 |
-2.7969 |
0.0052 |
| raceOther |
-0.2054 |
0.0671 |
-3.0602 |
0.0022 |
| log_BMI |
0.6340 |
0.8695 |
0.7292 |
0.4659 |
| height |
-0.0660 |
0.0975 |
-0.6772 |
0.4983 |
| log_body.weight |
0.1011 |
0.8701 |
0.1162 |
0.9075 |
| hbp |
-0.0668 |
0.0403 |
-1.6602 |
0.0969 |
| avg.dbp |
0.1091 |
0.0134 |
8.1549 |
0.0000 |
| log_avg.sbp |
0.4214 |
0.1375 |
3.0636 |
0.0022 |
| obese |
-0.0581 |
0.0367 |
-1.5830 |
0.1135 |
# Print MSE and R² using kable
kable(metrics_df, digits = 4, caption = "Model Performance Metrics (MSE & R²)", format = "markdown")
Model Performance Metrics (MSE & R²)
| Mean Squared Error |
0.8762 |
| R-squared |
0.1266 |
Stepwise BIC
In this analysis, a Stepwise BIC regression model is implemented to
identify the most parsimonious subset of predictors for estimating
cholesterol levels. The Stepwise BIC approach selects variables by
minimizing the Bayesian Information Criterion (BIC), which penalizes
model complexity to prevent overfitting. To assess the model’s
generalizability, we employ cross-validation by splitting the dataset
into training and testing subsets. The model is trained on the training
data, and predictions are made on the test data to compute the Mean
Squared Error (MSE) as a measure of model performance. Additionally, the
estimated regression coefficients in a structured table for clarity and
interpretation. This approach ensures an optimal balance between model
complexity and predictive accuracy while validating the model’s
robustness.
stepwise_bic_model <- function(train_data, test_data) {
# Fit initial full model
full_model2 <- lm(cholestrol ~ age + height + log_body.weight + avg.dbp, data = train_data)
# Perform stepwise selection using BIC
model_stepwise_BIC <- stepAIC(full_model2, direction = "both", k = log(nrow(train_data)), trace = FALSE)
# Extract coefficients
coeff_table <- summary(model_stepwise_BIC)$coefficients %>%
as.data.frame() %>%
rownames_to_column(var = "Predictor") # Convert row names to a column
# Print coefficients using kable
print(kable(coeff_table, digits = 4, caption = "Stepwise BIC Model Coefficients", format = "markdown"))
# Predict on test data
predictions <- predict(model_stepwise_BIC, newdata = test_data)
# Compute Mean Squared Error (MSE)
mse <- mean((test_data$cholestrol - predictions)^2)
# Compute R²
rss <- sum((test_data$cholestrol - predictions)^2) # Residual Sum of Squares
tss <- sum((test_data$cholestrol - mean(test_data$cholestrol))^2) # Total Sum of Squares
r2 <- 1 - rss/tss # R-squared
# Create a data frame for MSE and R²
metrics_df <- data.frame(
Statistic = c("Mean Squared Error", "R-squared"),
Value = c(mse, r2)
)
# Print MSE and R² using kable
print(kable(metrics_df, digits = 4, caption = "Stepwise BIC Model Performance Metrics (MSE & R²)", format = "markdown"))
return(list(model = model_stepwise_BIC, mse = mse, r2 = r2)) # Return the model, MSE, and R²
}
# Example usage (Assuming you have a dataset split into train_data and test_data)
set.seed(123)
train_index <- createDataPartition(imputed_nhanes$cholestrol, p = 0.8, list = FALSE)
train_data <- imputed_nhanes[train_index, ]
test_data <- imputed_nhanes[-train_index, ]
# Run the function
stepwise_results <- stepwise_bic_model(train_data, test_data)
Table: Stepwise BIC Model Coefficients
|Predictor | Estimate| Std. Error| t value| Pr(>|t|)|
|:---------------|--------:|----------:|-------:|------------------:|
|(Intercept) | -3.3622| 0.3416| -9.8427| 0|
|age | 0.2671| 0.0120| 22.2994| 0|
|height | -0.1274| 0.0138| -9.2161| 0|
|log_body.weight | 0.6589| 0.0669| 9.8520| 0|
|avg.dbp | 0.1191| 0.0125| 9.5632| 0|
Table: Stepwise BIC Model Performance Metrics (MSE & R²)
|Statistic | Value|
|:------------------|------:|
|Mean Squared Error | 0.8383|
|R-squared | 0.1425|
Ridge Regression
Ridge regression is applied to the NHANES dataset to address
multicollinearity among predictor variables while maintaining model
stability. By imposing a penalty on large coefficients through L2
regularization, Ridge regression prevents overfitting and improves
generalization. The model was evaluated using cross-validation, and its
performance was compared to other regression approaches based on Mean
Squared Error (MSE) and R².
# Define predictor matrix (X) and response variable (y)
X <- as.matrix(imputed_nhanes[, c("age", "log_BMI", "height", "log_body.weight", "avg.dbp", "log_avg.sbp")])
y <- imputed_nhanes$cholestrol
set.seed(123) # For reproducibility
# Define training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
# Train Ridge Regression Model
ridge_model <- train(
x = X, y = y,
method = "glmnet",
trControl = train_control,
tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(-3, 3, length = 100)) # Ridge: alpha = 0
)
# Best lambda value selected via cross-validation
best_lambda <- ridge_model$bestTune$lambda
# Predict using best lambda
ridge_predictions <- predict(ridge_model, X)
# Compute Mean Squared Error (MSE)
ridge_mse <- mean((y - ridge_predictions)^2)
# Compute R²
rss <- sum((y - ridge_predictions)^2) # Residual Sum of Squares
tss <- sum((y - mean(y))^2) # Total Sum of Squares
ridge_r2 <- 1 - (rss / tss) # R² calculation
# Extract coefficients from the final Ridge model
coefficients <- as.data.frame(as.matrix(ridge_model$finalModel$beta)) %>%
tibble::rownames_to_column(var = "Predictor") # Convert row names to a column
#print(kable(head(coefficients, 10), digits = 4, caption = "Top 10 Ridge Regression Coefficients", format = "markdown"))
# Create a data frame for MSE, R², and Best Lambda
metrics_df <- data.frame(
Statistic = c("Mean Squared Error", "R-squared", "Optimal Lambda"),
Value = c(ridge_mse, ridge_r2, best_lambda)
)
# Print model performance metrics using kable()
print(kable(metrics_df, digits = 4, caption = "Ridge Regression Performance Metrics", format = "markdown"))
Table: Ridge Regression Performance Metrics
|Statistic | Value|
|:------------------|------:|
|Mean Squared Error | 0.8757|
|R-squared | 0.1242|
|Optimal Lambda | 0.0285|
Comparison
After performing cross-validation on the NHANES_imputed dataset, the
predictive performance of the thre models are compared. The BIC model
has the smallest MSE and highest R^2 value, indicating a better
performing regression model. In, turn, the BIC model is recommended.
LOGISTIC
REGRESSION
This analysis applies logistic regression to predict current smoking
status using data from NHANES. Four models are developed: a full model
including all relevant predictors, a stepwise AIC model for variable
selection, a BIC model for a more parsimonious approach, and a
regularized logistic regression model using Lasso to prevent
overfitting. The models are evaluated and compared using Receiver
Operating Characteristic (ROC) curves and the Area Under the Curve (AUC)
to assess their classification performance and predictive accuracy.
Full Model
The first candidate, a full model, is fitted to predict current
smoking status.
# Fit logistic regression model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese +
height + log_avg.sbp + avg.dbp + log_cholestrol + hbp,
data = imputed_nhanes,
family = binomial)
# Extract coefficients from the model
coefficients <- summary(full_model)$coefficients
# Convert coefficients to a data frame and add row names as a column
coeff_table <- as.data.frame(coefficients) %>%
tibble::rownames_to_column(var = "Predictor") # Convert row names to a column
# Print the coefficients table using kable
kable(coeff_table, digits = 4, caption = "Logistic Regression Model Coefficients", format = "markdown")
Logistic Regression Model Coefficients
| (Intercept) |
-9.6100 |
4.2576 |
-2.2571 |
0.0240 |
| age |
0.8518 |
0.0341 |
24.9430 |
0.0000 |
| raceBlack |
-0.8473 |
0.0583 |
-14.5294 |
0.0000 |
| raceOther |
-0.3106 |
0.1567 |
-1.9825 |
0.0474 |
| log_body.weight |
2.6402 |
2.0813 |
1.2685 |
0.2046 |
| log_BMI |
-0.6406 |
2.0792 |
-0.3081 |
0.7580 |
| obese |
-0.1208 |
0.0867 |
-1.3930 |
0.1636 |
| height |
-0.2097 |
0.2330 |
-0.8999 |
0.3682 |
| log_avg.sbp |
-0.4720 |
0.3284 |
-1.4375 |
0.1506 |
| avg.dbp |
0.0655 |
0.0324 |
2.0189 |
0.0435 |
| log_cholestrol |
0.1397 |
0.1348 |
1.0365 |
0.3000 |
| hbp |
-0.0763 |
0.0957 |
-0.7975 |
0.4252 |
Age, race, and average dbp are significant features in the full
model.
Stepwise AIC
Model
The second model, stepwise AIC, is fitted to predict current smoking
status. This model automatically selects a subset of predictors using
AIC-based stepwise selection.
# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese +
height + log_avg.sbp + avg.dbp + log_cholestrol + hbp,
data = imputed_nhanes,
family = binomial)
# Perform stepwise selection using AIC
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)
# Extract coefficients from the stepwise model
coefficients_stepwise <- summary(stepwise_model)$coefficients
# Convert coefficients to a data frame and add row names as a column
coeff_table_stepwise <- as.data.frame(coefficients_stepwise) %>%
tibble::rownames_to_column(var = "Predictor") # Convert row names to a column
# Print the coefficients table using kable
kable(coeff_table_stepwise, digits = 4, caption = "Stepwise Logistic Regression Model Coefficients", format = "markdown")
Stepwise Logistic Regression Model Coefficients
| (Intercept) |
-7.1104 |
1.5749 |
-4.5148 |
0.0000 |
| age |
0.8583 |
0.0335 |
25.6466 |
0.0000 |
| raceBlack |
-0.8515 |
0.0582 |
-14.6232 |
0.0000 |
| raceOther |
-0.3164 |
0.1565 |
-2.0212 |
0.0433 |
| log_body.weight |
2.0364 |
0.2072 |
9.8304 |
0.0000 |
| obese |
-0.1260 |
0.0865 |
-1.4575 |
0.1450 |
| height |
-0.1417 |
0.0341 |
-4.1583 |
0.0000 |
| log_avg.sbp |
-0.6344 |
0.2525 |
-2.5125 |
0.0120 |
| avg.dbp |
0.0698 |
0.0323 |
2.1632 |
0.0305 |
All features, other than obese, are significant in the stepwise AIC
model.
BIC Model
The third candidate,a BIC model, is fitted to predict current smoking
status. This model fits a more parsimonious model.
# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese +
height + log_avg.sbp + avg.dbp + log_cholestrol + hbp,
data = imputed_nhanes,
family = binomial)
# Perform stepwise selection using BIC (using k = log(nrow(imputed_nhanes)))
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)
# Extract coefficients from the BIC model
coefficients_bic <- summary(bic_model)$coefficients
# Convert coefficients to a data frame and add row names as a column
coeff_table_bic <- as.data.frame(coefficients_bic) %>%
tibble::rownames_to_column(var = "Predictor") # Convert row names to a column
# Print the coefficients table using kable
kable(coeff_table_bic, digits = 4, caption = "BIC Stepwise Model Coefficients", format = "markdown")
BIC Stepwise Model Coefficients
| (Intercept) |
-9.1460 |
0.7115 |
-12.8541 |
0.0000 |
| age |
0.8140 |
0.0268 |
30.3780 |
0.0000 |
| raceBlack |
-0.8552 |
0.0579 |
-14.7775 |
0.0000 |
| raceOther |
-0.3123 |
0.1562 |
-1.9988 |
0.0456 |
| log_body.weight |
1.8297 |
0.1394 |
13.1290 |
0.0000 |
| height |
-0.1092 |
0.0290 |
-3.7669 |
0.0002 |
All features are significant in the BIC model.
Regularized Logistic
Regression Model (Lasso)
The fourth candidate model uses Regularized Logistic Regression
(Lasso). This method selects the most important predictors.
# Define predictor matrix (X) and response variable (y)
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp,
data = imputed_nhanes)[, -1] # Exclude intercept
y <- imputed_nhanes$current.smk
# Train the Lasso model using cross-validation
lasso_model <- cv.glmnet(x, y, family = "binomial", alpha = 1)
# Best lambda value selected via cross-validation
best_lambda_lasso <- lasso_model$lambda.min
print(paste("Optimal Lambda for Lasso:", best_lambda_lasso))
[1] "Optimal Lambda for Lasso: 0.000783042982863974"
# Extract coefficients from the final Lasso model at the optimal lambda
lasso_coefficients <- coef(lasso_model, s = "lambda.min")
# Convert coefficients to a data frame
lasso_coefficients_df <- as.data.frame(as.matrix(lasso_coefficients))
lasso_coefficients_df$Predictor <- rownames(lasso_coefficients_df)
# Reorder columns to make predictor names the first column
lasso_coefficients_df <- lasso_coefficients_df %>%
select(Predictor, everything())
# Print the coefficients table using kable
kable(lasso_coefficients_df, digits = 4, caption = "Lasso Regression Coefficients", format = "markdown")
Lasso Regression Coefficients
| (Intercept) |
(Intercept) |
-6.4108 |
| age |
age |
0.8393 |
| raceBlack |
raceBlack |
-0.8379 |
| raceOther |
raceOther |
-0.2843 |
| log_body.weight |
log_body.weight |
0.7797 |
| log_BMI |
log_BMI |
1.1279 |
| obese |
obese |
-0.0866 |
| height |
height |
0.0000 |
| log_avg.sbp |
log_avg.sbp |
-0.3598 |
| avg.dbp |
avg.dbp |
0.0546 |
| log_cholestrol |
log_cholestrol |
0.1316 |
| hbp |
hbp |
-0.0746 |
Comparison
In this analysis, four logistic regression models—Full Model,
Stepwise AIC Model, BIC Model, and Lasso Model are compared using ROC
(Receiver Operating Characteristic) curves and AUC (Area Under the
Curve) to assess their predictive performance. The optimal cutoff for
each model using Youden’s Index is determined, which maximizes the
balance between sensitivity and specificity. By evaluating these models
through both graphical and numerical metrics, insights is gained into
which model provides the best classification performance for predicting
the outcome variable.
# Full Model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, data = imputed_nhanes, family = binomial)
# Stepwise AIC Model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)
# BIC Model
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)
# Lasso Model
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp , data = imputed_nhanes)[, -1] # Exclude intercept
y <- imputed_nhanes$current.smk
lasso_cv <- cv.glmnet(x, y, family = "binomial", alpha = 1)
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = lasso_cv$lambda.min)
# Get predicted probabilities
imputed_nhanes$full_pred <- predict(full_model, type = "response")
imputed_nhanes$stepwise_pred <- predict(stepwise_model, type = "response")
imputed_nhanes$bic_pred <- predict(bic_model, type = "response")
imputed_nhanes$lasso_pred <- predict(lasso_model, newx = x, type = "response")[,1]
# Create ROC curves
full_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$full_pred)
stepwise_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$stepwise_pred)
bic_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$bic_pred)
lasso_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$lasso_pred)
# Calculate optimal cutoff using Youden's Index
optimal_cutoff_full <- full_roc$specificities[which.max(full_roc$sensitivities + full_roc$specificities - 1)]
optimal_cutoff_stepwise <- stepwise_roc$specificities[which.max(stepwise_roc$sensitivities + stepwise_roc$specificities - 1)]
optimal_cutoff_bic <- bic_roc$specificities[which.max(bic_roc$sensitivities + bic_roc$specificities - 1)]
optimal_cutoff_lasso <- lasso_roc$specificities[which.max(lasso_roc$sensitivities + lasso_roc$specificities - 1)]
# Print optimal cutoffs
cat("Optimal cutoff for Full Model:", optimal_cutoff_full, "\n")
Optimal cutoff for Full Model: 0.714215
cat("Optimal cutoff for Stepwise AIC Model:", optimal_cutoff_stepwise, "\n")
Optimal cutoff for Stepwise AIC Model: 0.7191679
cat("Optimal cutoff for BIC Model:", optimal_cutoff_bic, "\n")
Optimal cutoff for BIC Model: 0.7209014
cat("Optimal cutoff for Lasso Model:", optimal_cutoff_lasso, "\n")
Optimal cutoff for Lasso Model: 0.7013373
# Plot all ROC curves
ggplot() +
geom_line(aes(x = full_roc$specificities, y = full_roc$sensitivities, color = "Full Model")) +
geom_line(aes(x = stepwise_roc$specificities, y = stepwise_roc$sensitivities, color = "Stepwise AIC")) +
geom_line(aes(x = bic_roc$specificities, y = bic_roc$sensitivities, color = "BIC Model")) +
geom_line(aes(x = lasso_roc$specificities, y = lasso_roc$sensitivities, color = "Lasso Model")) +
labs(title = "ROC Curve Comparison", x = "1 - Specificity", y = "Sensitivity") +
scale_color_manual(values = c("red", "blue", "green", "purple", "orange")) +
theme_minimal()

# AUC values
auc_values <- data.frame(
Model = c("Full Model", "Stepwise AIC", "BIC Model", "Lasso Model"),
AUC = c(auc(full_roc), auc(stepwise_roc), auc(bic_roc), auc(lasso_roc))
)
print(auc_values)
Model AUC
1 Full Model 0.7509937
2 Stepwise AIC 0.7507281
3 BIC Model 0.7496650
4 Lasso Model 0.7508550
All models have very similar AUC. In turn, the more parsimonious
model, BIC, is recommended.
---
title: 'Analysis of The National Health and Nutrition Examination Survey (NHANES)'
author: "Joanne Musa"
date: "March 5, 2025"
output: 
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
editor_options: 
  chunk_output_type: inline
---

```{=html}

<style type="text/css">

/* Cascading Style Sheets (CSS) is a stylesheet language used to describe the presentation of a document written in HTML or XML. it is a simple mechanism for adding style (e.g., fonts, colors, spacing) to Web documents. */

h1.title {  /* Title - font specifications of the report title */
  font-size: 24px;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
  font-weight: bold;
}
h4.author { /* Header 4 - font specifications for authors  */
  font-size: 20px;
  font-family: system-ui;
  color: DarkRed;
  text-align: center;
  font-weight: bold;
}
h4.date { /* Header 4 - font specifications for the date  */
  font-size: 18px;
  font-family: system-ui;
  color: DarkBlue;
  text-align: center;
}
h1 { /* Header 1 - font specifications for level 1 section title  */
    font-size: 22px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: center;
    font-weight: bold;
}
h2 { /* Header 2 - font specifications for level 2 section title */
    font-size: 20px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h3 { /* Header 3 - font specifications of level 3 section title  */
    font-size: 18px;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
    font-weight: bold;
}

h4 { /* Header 4 - font specifications of level 4 section title  */
    font-size: 16px;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
    font-weight: bold;
}

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

</style>
```

```{r setup, include=FALSE}
# code chunk specifies whether the R code, warnings, and output 
# will be included in the output files.

if (!require("knitr")) {
   install.packages("knitr")
   library(knitr)
}

if (!require("MASS")) {
   install.packages("MASS")
   library(MASS)
}
if (!require("leaflet")) {
   install.packages("leaflet")
   library(leaflet)
}

if (!require("ggridges")) {
   install.packages("ggridges")
library(ggridges)
}
if (!require("tidyverse")) {
   install.packages("tidyverse")
library(tidyverse)
}

if (!require("ggplot2")) {
   install.packages("ggplot2")
   library(ggplot2)
}
if (!require("nnet")) {
   install.packages("nnet")
   library(nnet)
}
 
if (!require("nortest")) {
   install.packages("nortest")
   library(nortest)
} 

if (!require("corrplot")) {
   install.packages("corrplot")
   library(corrplot)
}

if (!require("VIM")) {
   install.packages("VIM")
   library(VIM)
}


if (!require("mice")) {
   install.packages("mice")
   library(mice)
}

if (!require("naniar")) {
   install.packages("naniar")
   library(naniar)

}

if (!require("GGally")) {
   install.packages("GGally")
   library(GGally)

}

if (!require("glmnet")) {
   install.packages("glmnet")
   library(glmnet)

}

if (!require("caret")) {
   install.packages("caret")
   library(caret)

}

if (!require("pROC")) {
   install.packages("pROC")
   library(pROC)

}


knitr::opts_chunk$set(
                      echo = TRUE,   
                      warning = FALSE,  
                      result = TRUE,    
                      message = FALSE,
                      comment = NA
                      )  
```



# INTRODUCTION


```{r}
nhanes <- read.csv("https://jmusa3.github.io/jmusa/nhanes.csv")

#CREATE MISSING VALUES in some variables

# Set seed for reproducibility
set.seed(123)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.14 * nrow(nhanes))

# Randomly select row indices to introduce missing values 
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$body.weight[missing_indices] <- NA

# Verify missing values
#sum(is.na(nhanes$body.weight))  


# Set seed for reproducibility
set.seed(456)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in body.weight to NA
nhanes$cholestrol[missing_indices] <- NA


# Set seed for reproducibility
set.seed(789)  

# Determine the number of missing values (15% of total rows)
num_missing <- round(0.15 * nrow(nhanes))

# Randomly select row indices to introduce missing values
missing_indices <- sample(1:nrow(nhanes), num_missing, replace = FALSE)

# Set selected rows in race to NA
nhanes$race[missing_indices] <- NA

# Verify missing values

#sum(is.na(nhanes$race))  
#sum(is.na(nhanes$cholestrol)) 
#sum(is.na(nhanes$body.weight)) 

# create a copy of nhanes for MICE
nhanes1 <- nhanes
nhanes1$race <- as.factor(nhanes1$race)
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)



```


The National Health and Nutrition Examination Survey (NHANES) is a program conducted by the National Center for Health Statistics (NCHS), a division of the Centers for Disease Control and Prevention (CDC). NHANES has been conducted in various forms since the early 1960s, with the current continuous survey format beginning in 1999. The survey is designed to assess the health and nutritional status of adults and children in the United States through a combination of interviews, physical examinations, and laboratory tests. Data is collected from a nationally representative sample of the U.S. population using a complex, multistage probability sampling design. Participants undergo household interviews followed by health examinations at mobile examination centers (MECs), where trained medical professionals collect biological samples and conduct clinical assessments. The NHANES data provides critical insights into public health trends, helping policymakers, researchers, and healthcare professionals make informed decisions on national health priorities. The purpose of this study is to assess the health and nutritional status of the study participants with metrics such as high blood pressure. 

There are 15 features and 7927 observations in this data set. These features are listed below.

1) obs: Respondent Identification Number
2) psu: Pseudo-PSU 1,2 psu
3) stratum: Pseudo-stratum (01 - 49)
4) stat.weight: Statistical weight (225.93 - 139744.9)
5) age: Age (years) 
6) sex: 0 = Female, 1 = Male sex
7) race: Race (1 = White; 2 = Black; 3 = Other)
8) body.weight: Body Weight (pounds)
9) height: Standing Height (inches)
10) avg.sbp: Average Systolic BP (mm/Hg) 
11) avg.dbp: Average Diastolic BP (mm/Hg) 
12) past.smk: Has respondent smoked > 100 cigarettes in life (1 = Yes, 2 = No)
13) current.smk: Does respondent smoke cigarettes now? (1 = Yes, 2 = No)
14) cholestrol: Serum Cholesterol (mg/100ml) 
15) hbp: High Blood Pressure (0 if PEPMNK1R <= 140; 1 if PEPMNK1R > 140)

The obs, psu, stratum, and stat.weight variables are not included in this analysis because they do not directly provide health-related information. 

The data set has 317 missing values for body weight and 396 missing values for cholesterol. 

# EDA


## Distribution of Individual Features

The distributions of the individual features in the NHANES dataset provide a snapshot of the characteristics of the population in terms of various health and demographic variables. Understanding these distributions is essential for identifying patterns, detecting potential biases, and assessing the range and variability of each feature. This analysis helps in understanding how variables such as age, body weight, blood pressure, and lifestyle factors are spread across the dataset, which is crucial for subsequent statistical modeling and interpretation.

```{r}

# Convert variables to factors with proper labels
nhanes$sex <- factor(nhanes$sex, levels = c(0, 1), labels = c("Female", "Male"))
nhanes$race <- factor(nhanes$race, levels = c(1, 2, 3), labels = c("White", "Black", "Other"))
nhanes$past.smk <- factor(nhanes$past.smk, levels = c(1, 2), labels = c("Yes", "No"))
nhanes$current.smk <- factor(nhanes$current.smk, levels = c(1, 2), labels = c("Yes", "No"))

# Create frequency tables
sex_count <- table(nhanes$sex)

race_count <- table(nhanes$race)

past.smk_count <- table(nhanes$past.smk)

current.smk_count <- table(nhanes$current.smk)


par(mfrow = c(2, 2))

# Plot the bar charts
barplot(height = sex_count, col = "steelblue", main = "Sex", xlab = "Sex", ylab = "Count")

barplot(height = race_count, col = "steelblue", main = "Race", xlab = "Race", ylab = "Count")

barplot(height = past.smk_count, col = "steelblue", main = "Past Smoker", xlab = "Past Smoker", ylab = "Count")

barplot(height = current.smk_count, col = "steelblue", main = "Current Smoker", xlab = "Current Smoker", ylab = "Count")

```

There are slightly more male observations than female observation in this data set. The majority of participants are white, followed by black. All participants are recorded as being past smokers, which is likely an error. There is an imbalance for past smoker. In turn, past smoker cannot be used for analysis. About half of current participants are current smokers. 

```{r}

par(mfrow = c(2, 2))  

# Histogram for Cholesterol
hist(nhanes$cholestrol, 
     main = "Histogram of Cholesterol", 
     xlab = "Cholesterol", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  # Control number of bins
    
     
# Histogram for Systolic BP
hist(nhanes$avg.sbp, 
     main = "Histogram of Systolic BP", 
     xlab = "Systolic BP", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
    

# Histogram for Body Weight
hist(nhanes$body.weight, 
     main = "Histogram of Body Weight", 
     xlab = "Body Weight", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
  
     
# Histogram for Height
hist(nhanes$body.weight, 
     main = "Histogram of Height", 
     xlab = "Height", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)   
     
```

Most participants do not have high blood pressure. The age of participants varies from about 20 to about 90. The height distribution is roughly symmetric. The body weight distribution is skewed right due to some high outliers. The high outliers may indicate obesity. 



```{r}

par(mfrow = c(2,2))

hist(nhanes$avg.sbp, col = "steelblue", 
     main = "Systolic BP",
     xlab = "Systolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$avg.dbp, col = "steelblue", 
     main = "Diastolic BP",
     xlab = "Diastolic BP", ylab = "Frequency",
     breaks = 30)

hist(nhanes$cholestrol, col = "steelblue", 
     main = "Cholesterol",
     xlab = "Cholesterol", ylab = "Frequency",
     breaks = 30)


```

Systolic BP and cholesterol have skewed right distributions. Diastolic BP is roughly symmetric. Of the 7927 participants, 1964 are obese. 


## Relationship between Features

Examining relationships between variables using scatterplots and a correlation matrix provides a comprehensive approach to understanding associations. Scatterplots visually reveal how two continuous variables interact, allowing us to identify trends, correlations, and potential outliers. Meanwhile, a correlation matrix quantifies the strength and direction of relationships between multiple variables at once, helping to pinpoint which pairs are strongly correlated. Together, scatterplots and a correlation matrix offer valuable insights into the underlying patterns in the data, guiding further statistical analysis and model selection.

```{r}
# Set seed for reproducibility
set.seed(123)

# Sample 500 rows (adjust the number as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Create a pairwise scatterplot matrix with the sampled data
pairs(sampled_data, 
      main = "Pairwise Scatterplots (Bootstrapped Sample)",
      pch = 19,           # Shape of the points (circle)
      col = rgb(0, 0, 1, 0.5)  # Color with transparency (blue)
)

```



```{r}

nhanes_subset <- nhanes[, c("age", "body.weight", "height", "avg.sbp", "avg.dbp", "cholestrol")]

# Compute the correlation matrix
cor_matrix <- cor(nhanes_subset, use = "pairwise.complete.obs")

# Print the correlation matrix
print(cor_matrix)

corrplot(cor_matrix, method = "color", type = "upper", tl.col = "black", tl.cex = 0.8)


```

The most highly correlate features are age and average systolic blood pressure, body weight and height, average systolic blood pressure and average diastolic blood pressure, age and cholesterol, body weight and cholesterol, and body weigh and average diastolic blood pressure. The pairwise relationships are displayed with individual scatter plots. 

```{r}
# Set seed for reproducibility
set.seed(123)

# Bootstrap a smaller sample (adjust size as needed)
sample_indices <- sample(nrow(nhanes), size = 500, replace = TRUE)
sampled_data <- nhanes[sample_indices, ]

# Set up a 2x3 plotting grid
par(mfrow = c(2, 3))

# Scatterplot for Age and Avg. Systolic Blood Pressure
plot(sampled_data$age, sampled_data$avg.sbp, main = "Age vs Avg. Systolic BP", 
     xlab = "Age", ylab = "Avg. Systolic BP", col = "blue", pch = 19)

# Scatterplot for Body Weight and Height
plot(sampled_data$body.weight, sampled_data$height, main = "Body Weight vs Height", 
     xlab = "Body Weight", ylab = "Height", col = "red", pch = 19)

# Scatterplot for Avg. Systolic Blood Pressure and Avg. Diastolic Blood Pressure
plot(sampled_data$avg.sbp, sampled_data$avg.dbp, main = "Avg. Systolic vs Avg. Diastolic BP", 
     xlab = "Avg. Systolic BP", ylab = "Avg. Diastolic BP", col = "forestgreen", pch = 19)

# Scatterplot for Age and Cholesterol
plot(sampled_data$age, sampled_data$cholestrol, main = "Age vs Cholesterol", 
     xlab = "Age", ylab = "Cholesterol", col = "purple", pch = 19)

# Scatterplot for Avg. Systolic BP and Cholesterol
plot(sampled_data$avg.sbp, sampled_data$cholestrol, main = "Avg Systolic BP vs Cholesterol", 
     xlab = "Cholesterol", ylab = "Avg.SBP", col = "gold", pch = 19)

# Scatterplot for Body Weight and Avg. Diastolic Blood Pressure
plot(sampled_data$body.weight, sampled_data$avg.dbp, main = "Body Weight vs Avg. Diastolic BP", 
     xlab = "Body Weight", ylab = "Avg. Diastolic BP", col = "maroon", pch = 19)
```

Six of the most correlated pairwise relationship are visualized in scatter plots. All pairwise relationships are linear with no apparent clustering. 


```{r}

# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

# Create a contingency table
sex_hbp_table <- table(nhanes$sex, nhanes$hbp)

# Generate mosaic plot
mosaicplot(sex_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and High Blood Pressure",
           xlab = "Sex Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
race_hbp_table <- table(nhanes$race, nhanes$hbp)

# Generate mosaic plot
mosaicplot(race_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Race and High Blood Pressure",
           xlab = "Race", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
current.smk_hbp_table <- table(nhanes$current.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(current.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and HBP",
           xlab = "Current Smoking Status", 
           ylab = "HBP Status",
           border = "black")


# Create a contingency table
past.smk_hbp_table <- table(nhanes$past.smk, nhanes$hbp)

# Generate mosaic plot
mosaicplot(past.smk_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Past Smoking Status and HBP",
           xlab = "Past Smoking Status", 
           ylab = "HBP Status",
           border = "black")
```

Relationship between high blood pressure and race, sex, and current smoking status can be visualized in the mosaic plots. There appears to be an association between high current smoking and race with high blood pressure. A less pronounced relationship can been seen between sex and race with high blood pressure. 


```{r}

# Set up a 2x2 plotting grid
par(mfrow = c(1, 2))


# Create a contingency table
current.smk_race_table <- table(nhanes$race, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Current Smoke Status",
           xlab = "Race", 
           ylab = "Current Smoke",
           border = "black")


# Create a contingency table
current.smk_sex_table <- table(nhanes$sex, nhanes$current.smk)

# Generate mosaic plot
mosaicplot(current.smk_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Current Smoke Status",
           xlab = "Sex", 
           ylab = "Current Smoke",
           border = "black")

# Create a contingency table
past.smk_race_table <- table(nhanes$race, nhanes$past.smk)



```

An association exists between race and current smoking. Black individuals tend to smoke more than white or other individuals. An association exists between sex and current smoke status. Female tend to smoke currently slight more than males in this population. 



# MISSING VALUES

## Imputation for Categorical Features

The distribution of missing values for is examined to determine any patterns. This step is done to ensure that value are missing at random. 

```{r}
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))

#sum(is.na(nhanes$race))

#mean(is.na(nhanes$race))

#table(nhanes$race, useNA = "ifany")


#ggplot(nhanes, aes(x = factor(race))) +
  #geom_bar(fill = "blue") +
  #labs(title = "Distribution of Race (Before Imputation)", x = "Race", y = "Count") +
  #theme_minimal()


aggr(nhanes, col = c("navyblue", "red"), numbers = TRUE, sortVars = TRUE, 
     labels = names(nhanes), cex.axis = 0.7, gap = 2, 
     ylab = c("Missing Data Overview", "Pattern of Missing Data"))

```

A pattern of missing values exists between body weight, BMI, and obese. This pattern can be explained by the fact that body weight is used to calculate BMI, which is then used to classify observations for obese. However, the pattern of missing value appears to be random for the other features. 

No pattern of missing values exists between race and other features, In turn, kNN imputation is used for missing values of race.  



```{r}

sum(is.na(nhanes$race))
table(nhanes$race, useNA = "ifany")  # Check distribution including NAs


# Perform KNN Imputation (single imputation)
nhanes_imputed <- kNN(nhanes, variable = "race", k = 3)

# Keep only the original columns (without extra columns added by kNN)
nhanes_imputed <- nhanes_imputed[, names(nhanes)]



# COMPARE before and after imputation distribution

par(mfrow = c(1,2))  # Split plotting area into two

# Convert race to a factor for better visualization
nhanes$race <- as.factor(nhanes$race)
nhanes_imputed$race <- as.factor(nhanes_imputed$race)

# Create a new column to indicate whether the data is original or imputed
nhanes$source <- "Before Imputation"
nhanes_imputed$source <- "After Imputation"

# Combine both datasets
nhanes_compare <- bind_rows(nhanes, nhanes_imputed)

# Plot the distribution before and after imputation
ggplot(nhanes_compare, aes(x = race, fill = source)) +
  geom_bar(position = "dodge") +
  labs(title = "Distribution of Race Before and After Imputation",
       x = "Race",
       y = "Count",
       fill = "Dataset") +
  theme_minimal()

imputed_nhanes <- nhanes_imputed


```
The overall distribution of race after imputation changed slightly after imputation.


## Regression-Based Imputation for Numerical Feature

The pattern of missing values for body weight and cholesterol is examined for any potential patterns.

```{r}


#vis_miss(nhanes[, c("body.weight", "cholestrol")])


# Set smaller font size for the plot
par(cex = 0.4)  # Adjust this value for desired font size

# Generate the missing data pattern plot

md.pattern(imputed_nhanes)

# Reset graphics settings (optional)
par(cex = 1)


```
A pattern of missing values exists between body weight, BMI, and obese because


Body weight is correlated with height, average diastolic blood pressure, and sex. In turn, the missing body weight values will be imputed using a regression model using said explanatory features. 

```{r}

# Subset nhanes to contain only complete observations for selected variables
complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("body.weight", "height", "avg.dbp", "sex")]), ]

# Fit the linear regression model
model_body.weight <- lm(body.weight ~ height + avg.dbp + sex, data = complete_nhanes)

# View the model summary
summary(model_body.weight)

# Plot the residuals
plot(model_body.weight$residuals)


```


The residual plot shows no patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied.

```{r}

# Create a copy of the dataset to work with
#imputed_nhanes <- complete_nhanes

colSums(is.na(imputed_nhanes))
# Identify the rows with missing values for body.weight
missing_body_weight <- is.na(imputed_nhanes$body.weight)

# Use the model to predict the missing values
imputed_nhanes$body.weight[missing_body_weight] <- predict(model_body.weight, 
                                                           newdata = imputed_nhanes[missing_body_weight, c("height", "avg.dbp", "sex", "age")])

# Check the dataset
#head(imputed_nhanes)

# Create a plot to overlay the density curves
ggplot() +
  geom_density(data = complete_nhanes, aes(x = body.weight), fill = "blue", alpha = 0.2, na.rm = TRUE) + 
  geom_density(data = imputed_nhanes, aes(x = body.weight), fill = "red", alpha = 0.2, na.rm = TRUE) +
  labs(title = "Density Plot of Body Weight: Before and After Imputation",
       x = "Body Weight", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Body Weight", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))


```


The distribution before and after imputation is similar. In turn, the imputation did not change the distribution of body weight. 




The features that are most correlated to cholesterol are age, average systolic blood pressure, and body weight. In turn, the missing values of cholesterol will be imputed with a model using these features. 

```{r}

# Subset nhanes to include only rows with complete cases for cholestrol, age, avg.sbp, and BMI
complete_nhanes <- imputed_nhanes[!is.na(imputed_nhanes$cholestrol) & !is.na(imputed_nhanes$age) & 
                           !is.na(imputed_nhanes$avg.sbp) & !is.na(imputed_nhanes$body.weight), ]

complete_nhanes <- imputed_nhanes[complete.cases(imputed_nhanes[, c("cholestrol", "age", "avg.sbp", "body.weight")]), ]

# Fit the linear regression model to predict cholestrol
model_cholestrol <- lm(cholestrol ~ age + avg.sbp + body.weight, data = complete_nhanes)
summary(model_cholestrol)


# View the model summary
summary(model_cholestrol)

# Create a copy of nhanes for imputation
#imputed_nhanes <- nhanes

# Identify rows with missing cholestrol values
missing_cholestrol <- is.na(imputed_nhanes$cholestrol)

# Use the model to predict the missing cholestrol values
imputed_nhanes$cholestrol[missing_cholestrol] <- predict(model_cholestrol, 
                                                          newdata = imputed_nhanes[missing_cholestrol, c("age", "avg.sbp", "body.weight")])


```

All explanatory features are significant. 

```{r}
# Get residuals from the model for complete_nhanes
residuals_cholestrol <- residuals(model_cholestrol)

# Predicted values for complete_nhanes
predicted_cholestrol <- predict(model_cholestrol, newdata = complete_nhanes)

# Create a data frame for residuals and predicted values
residuals_data <- data.frame(
  predicted_cholestrol = predicted_cholestrol,
  residuals_cholestrol = residuals_cholestrol
)

# Create the residual plot
library(ggplot2)
ggplot(residuals_data, aes(x = predicted_cholestrol, y = residuals_cholestrol)) +
  geom_point(color = "blue") +  # Plot the residuals
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +  # Add a horizontal line at 0
  labs(title = "Residual Plot for Cholestrol Prediction",
       x = "Predicted Cholestrol",
       y = "Residuals") +
  theme_minimal()

# Create the overlay density plot
ggplot() +
  geom_density(data = nhanes, aes(x = cholestrol), fill = "blue", alpha = 0.2, na.rm = TRUE) +  # Before imputation
  geom_density(data = imputed_nhanes, aes(x = cholestrol), fill = "red", alpha = 0.2, na.rm = TRUE) +  # After imputation
  labs(title = "Density Plot of Cholestrol: Before and After Imputation",
       x = "Cholestrol", y = "Density") +
  theme_minimal() +
  scale_fill_manual(name = "Cholestrol", values = c("blue" = "blue", "red" = "red"), 
                    labels = c("Before Imputation", "After Imputation"))


```

The distribution of cholesterol did not change significantly after imputation. As well, the residual plot does not display any patterns. In turn, the model fit the data well and the assumption of constant variance of residuals is satisfied. 

## Feature Engineer

In turn a feature which calculates BMI is created to determine observations are classified as obese. According to the CDC (https://www.cdc.gov/nccdphp/dnpao/data-trends-maps/help/npao_dtm/definitions.html), Adult obesity is defined as body mass index (BMI) ≥ 30.0. Another feature named obese, based on BMI, is created, where 0 = no and 1 = yes. 

```{r}

# Create BMI feature
imputed_nhanes$BMI <- (imputed_nhanes$body.weight * 703) / (imputed_nhanes$height * imputed_nhanes$height)
# Round the BMI to the nearest ones place
imputed_nhanes$BMI <- round(imputed_nhanes$BMI, 0)

# Create the obese feature based on BMI
imputed_nhanes$obese <- ifelse(imputed_nhanes$BMI >= 30, 1, 0)

```

```{r}

par(mfrow = c(1, 2))

# Histogram for BMI

hist(imputed_nhanes$BMI, 
     main = "Histogram of BMI", 
     xlab = "BMI", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)

# bar chart for obese
nhanes$obese <- factor(imputed_nhanes$obese, levels = c(0, 1), labels = c("No", "Yes"))

obese_count <- table(imputed_nhanes$obese)

barplot(height = obese_count, col = "steelblue", main = "Obesity", xlab = "Obesity", ylab = "Count")


```


```{r}
# Set up a 2x2 plotting grid
par(mfrow = c(2, 2))


# Create a contingency table
obese_hbp_table <- table(imputed_nhanes$obese, imputed_nhanes$hbp)

# Generate mosaic plot
mosaicplot(obese_hbp_table, 
           col = c("gold", "royalblue"), 
           main = "Obesity and High Blood Pressure",
           xlab = "Obesity Status", 
           ylab = "HBP Status",
           border = "black")

# Create a contingency table
obese_sex_table <- table(imputed_nhanes$sex, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_sex_table, 
           col = c("gold", "royalblue"), 
           main = "Sex and Obesity",
           xlab = "Obesity Status", 
           ylab = "Sex",
           border = "black")


# Create a contingency table
obese_race_table <- table(imputed_nhanes$race, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_race_table, 
           col = c("gold", "royalblue"), 
           main = "Race and Obesity",
           xlab = "Obesity Status", 
           ylab = "Race",
           border = "black")


# Create a contingency table
obese_smoke_table <- table(imputed_nhanes$current.smk, imputed_nhanes$obese)

# Generate mosaic plot
mosaicplot(obese_smoke_table, 
           col = c("gold", "royalblue"), 
           main = "Current Smoking Status and Obesity",
           xlab = "Obesity Status", 
           ylab = "Current Smoking Status",
           border = "black")


```





## Multiple Imputation

Multiple Imputation by Chained Equations (MICE) is a powerful technique used to handle missing data by creating multiple plausible imputations for each missing value based on relationships in the observed data. For body weight, cholesterol, and race, MICE can be applied to generate accurate imputations, accounting for correlations between these variables and other features in the dataset. This method allows for more reliable statistical analyses by reducing bias and preserving variability in the imputed values, making it especially useful for handling missing data in incoming new data sets.

```{r}

nhanes1$race <- as.factor(nhanes1$race)  
nhanes1$cholestrol <- as.numeric(nhanes1$cholestrol)
nhanes1$body.weight <- as.numeric(nhanes1$body.weight)

nhanes2 <- nhanes1


# 1 initialization

init <- mice(nhanes2, maxit=0)
print(init$method)


# 2 imputation models

imp <- mice(nhanes2, method = c("", "", "", "", "", "", "", "polyreg", "pmm", "", "", "", "", "", "pmm", ""), 
            maxit=10,
            m=1,
            seed = 123,
            print=F)

complete(imp, action = 1)[1:10,]


# 3 multiple (iterative) imputations

imp5 <- mice(nhanes2, method = "pmm", m = 5, maxit = 10, seed = 123, print=F)
plot(imp5)



```


Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the body weight variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for body weight earlier in this analysis, providing a measure of the imputation accuracy.


```{r}
model5_bodyweight <- with(imp5, lm(body.weight ~ height + avg.dbp + sex))
summary.stats = summary(model5_bodyweight)

summary(pool(model5_bodyweight))

beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)


```
The model can be used to predict body weight for incoming new data. 

The MSE for single body weight imputation used earlier in this analysis is MSE = (RSE)^2 = (33.52)^2 = 1123.6 which is substantially higher than MSE when using MICE. In turn, MICE is recommended in this analysis. 



Multiple Imputation by Chained Equations (MICE) will be applied to handle missing values in the cholesterol variable. To assess the quality of the imputations, the Mean Squared Error (MSE) will be calculated and compared the MSE of the single imputation conducted for cholesterol earlier in this analysis, providing a measure of the imputation accuracy.

```{r}
model5_chol <- with(imp5, lm(cholestrol ~ age + avg.sbp))
summary.stats = summary(model5_chol)


summary(pool(model5_chol))

beta = summary.stats$estimate[seq(1,15,by=3)]  # explicit vector: c(1,4,7,10,13)
beta.var = (summary.stats$std.error[seq(1,15,by=3)])^2 
Q = mean(beta)
U = mean(beta.var)
B = var(beta)
T = U + (6/5)*B
pool.se = sqrt(T)
cbind(pool.se.intercept = pool.se)


```
The model can be used to predict cholesterol for incoming new data. 

The Mean Squared Error (MSE) from the Residual Standard Error (RSE) given in the model output from the cholesterol single imputation method used early in this analysis is MSE = (42.59)^2 = 1813.9. Thus, the Mean Squared Error (MSE) = 1813.9.

The MSE using MICE is 3.66 which is substantially less. Therefore, MICE is recommended in the analysis of this data set. 



For the missing values in the race variable, Multiple Imputation by Chained Equations (MICE) will be applied using multinomial logistic regression. This method will account for the categorical nature of the race variable, allowing for appropriate imputation based on the relationships with other features in the dataset.

```{r}

model_race <- with(imp5, multinom(race ~ avg.dbp + avg.sbp + cholestrol + hbp))

# Pool the results from the imputed datasets
pooled_race <- pool(model_race)

# Get a summary of the pooled results
summary(pooled_race)

```

The model can be used to predict race for incoming new data. 


# FEATURE TRANSFORMING

Based on the distribution patterns observed in the NHANES data set, several feature engineering tasks can be performed to enhance the data’s usability for modeling. First, for age, which is roughly uniform, no major transformation is needed, though we may consider binning it into age groups if it helps with interpretability. For weight and systolic blood pressure (BP), which are both skewed right, a log transformation could be applied to reduce skewness and make these variables more normally distributed, improving model performance. Similarly, for cholesterol, which is slightly skewed right, a log transformation may also be useful. For diastolic BP, which is roughly symmetric, no transformation is necessary, but checking for outliers could be helpful. These transformations aim to stabilize variance, reduce the impact of extreme values, and create features that are more suitable for modeling with techniques like linear regression or other statistical methods.


```{r}
#Log transformation to reduce skew

imputed_nhanes$log_body.weight <- log(imputed_nhanes$body.weight)
imputed_nhanes$log_avg.sbp <- log(imputed_nhanes$avg.sbp)
imputed_nhanes$log_cholestrol <- log(imputed_nhanes$cholestrol)
imputed_nhanes$log_BMI <- log(imputed_nhanes$BMI)




# Set up the plotting area to show 2x2 grid of histograms
par(mfrow = c(2, 2))  # This will create a 2x2 grid of plots

# Histogram for Log-Cholesterol
hist(imputed_nhanes$log_cholestrol, 
     main = "Histogram of Log(Cholesterol)", 
     xlab = "Log(Cholesterol)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)  
    
# Histogram for Log-Body Weight
hist(imputed_nhanes$log_body.weight, 
     main = "Histogram of Log(Body Weight)", 
     xlab = "Log(Body Weight)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20) 


# Histogram for Log-Systolic BP
hist(imputed_nhanes$log_avg.sbp, 
     main = "Histogram of Log(Systolic BP)", 
     xlab = "Log(Systolic BP)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)
   

# Histogram for Log-BMI
hist(imputed_nhanes$log_BMI, 
     main = "Histogram of Log(BMI)", 
     xlab = "Log(BMI)", 
     col = "steelblue", 
     border = "black", 
     breaks = 20)


```

The distributions of these previously skewed features are more symmetric after log transformations. 


# STANDARDIZING

Standardizing features ensures that variables with different scales, such as age, BMI, height, and cholesterol, contribute equally to the analysis. Without standardization, variables with larger ranges could dominate model performance, leading to biased results. By standardizing these features, we improve model accuracy, enable better comparison between variables, and enhance the efficiency of machine learning algorithms, particularly in models that rely on distance-based or gradient-based methods.


```{r}

# Specify the columns to be standardized
columns_to_standardize <- c("age", "BMI", "height", "body.weight", "cholestrol", "avg.dbp", "avg.sbp")

# Standardize the selected columns
imputed_nhanes[columns_to_standardize] <- scale(imputed_nhanes[columns_to_standardize])

  
```





# FEATURE SELECTION

Feature selection is a critical step in model building, as it helps improve model performance by identifying the most relevant predictors and reducing overfitting. In this analysis, both model-based selection and regularization-based selection techniques will be used to assess and retain the most influential features for accurate prediction, ensuring a more efficient and interpretable model.


A full model is created to predict cholesterol.

```{r}

#sapply(imputed_nhanes, function(x) length(unique(x)))

#head(imputed_nhanes, 25)


#head(imputed_nhanes)


model_full <- lm(cholestrol ~ age + race + BMI + height + body.weight + 
                  hbp + avg.dbp + avg.sbp + obese, 
                  data = imputed_nhanes)

# Summary of the full model
#summary(model_full)

# Tidy the model summary
model_summary <- tidy(model_full)

# Calculate Mean Squared Error (MSE)
mse <- mean(model_full$residuals^2)

# Calculate R-squared
r2 <- summary(model_full)$r.squared

# Create a data frame for MSE and R-squared
metrics_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                         Value = c(mse, r2))

# Print the model summary using kable
kable(model_summary, digits = 3, caption = "Regression Model Summary")

# Print MSE and R-squared using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics")

```

Stepwise model selection is conducted to retain significant predictors. Selection processes based on AIC and BIC are conducted. 

```{r}
# Stepwise model selection based on AIC
model_stepwise_AIC <- step(model_full, direction = "both", trace = 0)

# Stepwise model selection based on BIC
model_stepwise_BIC <- step(model_full, direction = "both", k = log(nrow(imputed_nhanes)), trace = 0)


# Tidy the summaries for both models
stepwise_AIC_summary <- tidy(model_stepwise_AIC)
stepwise_BIC_summary <- tidy(model_stepwise_BIC)

# Compute MSE for both models
mse_AIC <- mean(model_stepwise_AIC$residuals^2)
mse_BIC <- mean(model_stepwise_BIC$residuals^2)

# Compute R-squared for both models
r2_AIC <- summary(model_stepwise_AIC)$r.squared
r2_BIC <- summary(model_stepwise_BIC)$r.squared

# Create data frames for MSE and R-squared
metrics_AIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_AIC, r2_AIC))

metrics_BIC_df <- data.frame(Statistic = c("Mean Squared Error", "R-squared"), 
                             Value = c(mse_BIC, r2_BIC))

# Print the stepwise AIC model summary
kable(stepwise_AIC_summary, digits = 3, caption = "Stepwise AIC Regression Model Summary")

# Print performance metrics for AIC model
kable(metrics_AIC_df, digits = 3, caption = "Model Performance Metrics (AIC)")

# Print the stepwise BIC model summary
kable(stepwise_BIC_summary, digits = 3, caption = "Stepwise BIC Regression Model Summary")

# Print performance metrics for BIC model
kable(metrics_BIC_df, digits = 3, caption = "Model Performance Metrics (BIC)")
```



```{r}
# Check R-squared value
summary(model_stepwise_AIC)$r.squared
summary(model_stepwise_BIC)$r.squared


```

Both AIC and BIC stepwise model selection processes have similar R squared values. The BIC stepwise model is parsimonious. This model selects age, height, body weight, and average diastolic blood pressure as predictors for cholesterol. 



Regularization-based selection techniques are useful for improving model performance and preventing overfitting, especially when you have many predictors. Two common regularization methods are Ridge regression and Lasso regression.

```{r}

# Extract the response and predictors
y <- imputed_nhanes$cholestrol  # Response variable
X <- as.matrix(imputed_nhanes[, c("age", "BMI", "height", "body.weight", "avg.dbp", "avg.sbp")])

# Optional: Scale predictors (recommended for Ridge and Lasso)
X_scaled <- scale(X)

# Fit Ridge and Lasso models
ridge_model <- glmnet(X_scaled, y, alpha = 0)  # Ridge Regression (alpha = 0)
lasso_model <- glmnet(X_scaled, y, alpha = 1)  # Lasso Regression (alpha = 1)

# Ridge cross-validation
cv_ridge <- cv.glmnet(X_scaled, y, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min

# Lasso cross-validation
cv_lasso <- cv.glmnet(X_scaled, y, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

# Extract coefficients for Ridge and Lasso at optimal lambda
ridge_coefficients <- as.data.frame(as.matrix(coef(cv_ridge, s = "lambda.min")))
colnames(ridge_coefficients) <- c("Coefficient")
ridge_coefficients$Predictor <- rownames(ridge_coefficients)

lasso_coefficients <- as.data.frame(as.matrix(coef(cv_lasso, s = "lambda.min")))
colnames(lasso_coefficients) <- c("Coefficient")
lasso_coefficients$Predictor <- rownames(lasso_coefficients)

# Predictions for Ridge
pred_ridge <- predict(cv_ridge, X_scaled, s = "lambda.min")

# Predictions for Lasso
pred_lasso <- predict(cv_lasso, X_scaled, s = "lambda.min")

# Calculate RMSE for Ridge
rmse_ridge <- sqrt(mean((pred_ridge - y)^2))

# Calculate RMSE for Lasso
rmse_lasso <- sqrt(mean((pred_lasso - y)^2))

# Calculate R² for Ridge
rss_ridge <- sum((pred_ridge - y)^2)  # Residual Sum of Squares
tss_ridge <- sum((y - mean(y))^2)    # Total Sum of Squares
r2_ridge <- 1 - rss_ridge / tss_ridge

# Calculate R² for Lasso
rss_lasso <- sum((pred_lasso - y)^2)
tss_lasso <- sum((y - mean(y))^2)
r2_lasso <- 1 - rss_lasso / tss_lasso

# Create a data frame for RMSE and R²
metrics_df <- data.frame(
  Model = c("Ridge", "Lasso"),
  RMSE = c(rmse_ridge, rmse_lasso),
  R2 = c(r2_ridge, r2_lasso)
)

# Print Ridge coefficients using kable
kable(ridge_coefficients, digits = 3, caption = "Ridge Regression Coefficients")

# Print Lasso coefficients using kable
kable(lasso_coefficients, digits = 3, caption = "Lasso Regression Coefficients")

# Print RMSE and R² values using kable
kable(metrics_df, digits = 3, caption = "Model Performance Metrics (RMSE & R²)")
```


The Ridge and Lasso method both had similar R squared values. Therefore, both models are recommended. 

# LINEAR REGRESSION

In this regression analysis of the NHANES dataset, three models are developed using cross validation to predict the outcome variable cholesterol: a full model including all selected predictors, a stepwise BIC model for variable selection, and a Ridge regression model to address multicollinearity. Each model was evaluated based on its predictive performance, with Mean Squared Error (MSE) and R-squared (R^2) used as key comparison metrics. By analyzing these models, the aim is to determine the most accurate and efficient approach for modeling the data while balancing complexity and predictive power.

```{r}

set.seed(42)
train_control <- trainControl(method = "cv", number = 10)

```


## Full Model

The full model includes all selected predictor variables, providing a comprehensive approach to estimating cholesterol levels. To assess its stability, we implement k-fold cross-validation, which systematically partitions the dataset into multiple training and validation subsets. This approach mitigates the risk of overfitting by testing the model’s performance on unseen data, yielding a robust estimate of prediction error. The model’s effectiveness is quantified using the Mean Squared Error (MSE), which measures the average squared difference between observed and predicted cholesterol values. By employing cross-validation, we ensure that the full model is evaluated rigorously, providing insights into its accuracy and reliability in real-world applications.


```{r}

# Set up 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)  # 10-fold CV

# Fit the full model
full_model <- train(cholestrol ~ age + race + log_BMI + height + log_body.weight + 
                    hbp + avg.dbp + log_avg.sbp + obese, 
                    data = imputed_nhanes, 
                    method = "lm", 
                    trControl = train_control)

# Extract coefficients from the final model
coeff_table <- summary(full_model$finalModel)$coefficients %>%
  as.data.frame() %>%
  rownames_to_column(var = "Predictor")  # Convert row names to a column

# Calculate Cross-Validation MSE
mse <- mean(full_model$resample$RMSE^2)

# Calculate R² from the final model
r2 <- summary(full_model$finalModel)$r.squared

# Create a data frame for MSE and R²
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared"),
  Value = c(mse, r2)
)

# Print the coefficients using kable
kable(coeff_table, digits = 4, caption = "Regression Model Coefficients", format = "markdown")

# Print MSE and R² using kable
kable(metrics_df, digits = 4, caption = "Model Performance Metrics (MSE & R²)", format = "markdown")
```



## Stepwise BIC

In this analysis, a Stepwise BIC regression model is implemented to identify the most parsimonious subset of predictors for estimating cholesterol levels. The Stepwise BIC approach selects variables by minimizing the Bayesian Information Criterion (BIC), which penalizes model complexity to prevent overfitting. To assess the model’s generalizability, we employ cross-validation by splitting the dataset into training and testing subsets. The model is trained on the training data, and predictions are made on the test data to compute the Mean Squared Error (MSE) as a measure of model performance. Additionally, the estimated regression coefficients in a structured table for clarity and interpretation. This approach ensures an optimal balance between model complexity and predictive accuracy while validating the model’s robustness.

```{r}

stepwise_bic_model <- function(train_data, test_data) {
  # Fit initial full model
  full_model2 <- lm(cholestrol ~ age + height + log_body.weight + avg.dbp, data = train_data)
  
  # Perform stepwise selection using BIC
  model_stepwise_BIC <- stepAIC(full_model2, direction = "both", k = log(nrow(train_data)), trace = FALSE)
  
  # Extract coefficients
  coeff_table <- summary(model_stepwise_BIC)$coefficients %>%
    as.data.frame() %>%
    rownames_to_column(var = "Predictor")  # Convert row names to a column

  # Print coefficients using kable
  print(kable(coeff_table, digits = 4, caption = "Stepwise BIC Model Coefficients", format = "markdown"))

  # Predict on test data
  predictions <- predict(model_stepwise_BIC, newdata = test_data)
  
  # Compute Mean Squared Error (MSE)
  mse <- mean((test_data$cholestrol - predictions)^2)

  # Compute R²
  rss <- sum((test_data$cholestrol - predictions)^2)  # Residual Sum of Squares
  tss <- sum((test_data$cholestrol - mean(test_data$cholestrol))^2)  # Total Sum of Squares
  r2 <- 1 - rss/tss  # R-squared
  
  # Create a data frame for MSE and R²
  metrics_df <- data.frame(
    Statistic = c("Mean Squared Error", "R-squared"),
    Value = c(mse, r2)
  )

  # Print MSE and R² using kable
  print(kable(metrics_df, digits = 4, caption = "Stepwise BIC Model Performance Metrics (MSE & R²)", format = "markdown"))

  return(list(model = model_stepwise_BIC, mse = mse, r2 = r2))  # Return the model, MSE, and R²
}

# Example usage (Assuming you have a dataset split into train_data and test_data)
set.seed(123)
train_index <- createDataPartition(imputed_nhanes$cholestrol, p = 0.8, list = FALSE)
train_data <- imputed_nhanes[train_index, ]
test_data <- imputed_nhanes[-train_index, ]

# Run the function
stepwise_results <- stepwise_bic_model(train_data, test_data)


```


## Ridge Regression

Ridge regression is applied to the NHANES dataset to address multicollinearity among predictor variables while maintaining model stability. By imposing a penalty on large coefficients through L2 regularization, Ridge regression prevents overfitting and improves generalization. The model was evaluated using cross-validation, and its performance was compared to other regression approaches based on Mean Squared Error (MSE) and R².

```{r}

# Define predictor matrix (X) and response variable (y)
X <- as.matrix(imputed_nhanes[, c("age", "log_BMI", "height", "log_body.weight", "avg.dbp", "log_avg.sbp")])
y <- imputed_nhanes$cholestrol

set.seed(123)  # For reproducibility

# Define training control with 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

# Train Ridge Regression Model
ridge_model <- train(
  x = X, y = y,
  method = "glmnet",
  trControl = train_control,
  tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(-3, 3, length = 100))  # Ridge: alpha = 0
)

# Best lambda value selected via cross-validation
best_lambda <- ridge_model$bestTune$lambda

# Predict using best lambda
ridge_predictions <- predict(ridge_model, X)

# Compute Mean Squared Error (MSE)
ridge_mse <- mean((y - ridge_predictions)^2)

# Compute R²
rss <- sum((y - ridge_predictions)^2)  # Residual Sum of Squares
tss <- sum((y - mean(y))^2)  # Total Sum of Squares
ridge_r2 <- 1 - (rss / tss)  # R² calculation

# Extract coefficients from the final Ridge model
coefficients <- as.data.frame(as.matrix(ridge_model$finalModel$beta)) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

#print(kable(head(coefficients, 10), digits = 4, caption = "Top 10 Ridge Regression Coefficients", format = "markdown"))


# Create a data frame for MSE, R², and Best Lambda
metrics_df <- data.frame(
  Statistic = c("Mean Squared Error", "R-squared", "Optimal Lambda"),
  Value = c(ridge_mse, ridge_r2, best_lambda)
)

# Print model performance metrics using kable()
print(kable(metrics_df, digits = 4, caption = "Ridge Regression Performance Metrics", format = "markdown"))
```



## Comparison

After performing cross-validation on the NHANES_imputed dataset, the predictive performance of the thre models are compared. The BIC model has the smallest MSE and highest R^2 value, indicating a better performing regression model. In, turn, the BIC model is recommended. 




# LOGISTIC REGRESSION

This analysis applies logistic regression to predict current smoking status using data from NHANES. Four models are developed: a full model including all relevant predictors, a stepwise AIC model for variable selection, a BIC model for a more parsimonious approach, and a regularized logistic regression model using Lasso to prevent overfitting. The models are evaluated and compared using Receiver Operating Characteristic (ROC) curves and the Area Under the Curve (AUC) to assess their classification performance and predictive accuracy.

## Full Model

The first candidate, a full model, is fitted to predict current smoking status. 

```{r}

# Fit logistic regression model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Extract coefficients from the model
coefficients <- summary(full_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table <- as.data.frame(coefficients) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table, digits = 4, caption = "Logistic Regression Model Coefficients", format = "markdown")



```

Age, race, and average dbp are significant features in the full model. 

## Stepwise AIC Model

The second model, stepwise AIC, is fitted to predict current smoking status. This model automatically selects a subset of predictors using AIC-based stepwise selection. 

```{r}


# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using AIC
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# Extract coefficients from the stepwise model
coefficients_stepwise <- summary(stepwise_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_stepwise <- as.data.frame(coefficients_stepwise) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_stepwise, digits = 4, caption = "Stepwise Logistic Regression Model Coefficients", format = "markdown")



```

All features, other than obese, are significant in the stepwise AIC model. 


## BIC Model

The third candidate,a BIC model, is fitted to predict current smoking status. This model fits a more parsimonious model. 

```{r}

# Fit initial full model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + 
                    height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes, 
                  family = binomial)

# Perform stepwise selection using BIC (using k = log(nrow(imputed_nhanes)))
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Extract coefficients from the BIC model
coefficients_bic <- summary(bic_model)$coefficients

# Convert coefficients to a data frame and add row names as a column
coeff_table_bic <- as.data.frame(coefficients_bic) %>%
  tibble::rownames_to_column(var = "Predictor")  # Convert row names to a column

# Print the coefficients table using kable
kable(coeff_table_bic, digits = 4, caption = "BIC Stepwise Model Coefficients", format = "markdown")


```

All features are significant in the BIC model.


## Regularized Logistic Regression Model (Lasso)

The fourth candidate model uses Regularized Logistic Regression (Lasso). This method selects the most important predictors.

```{r}

# Define predictor matrix (X) and response variable (y)
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, 
                  data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk

# Train the Lasso model using cross-validation
lasso_model <- cv.glmnet(x, y, family = "binomial", alpha = 1)


# Best lambda value selected via cross-validation
best_lambda_lasso <- lasso_model$lambda.min
print(paste("Optimal Lambda for Lasso:", best_lambda_lasso))

# Extract coefficients from the final Lasso model at the optimal lambda
lasso_coefficients <- coef(lasso_model, s = "lambda.min")

# Convert coefficients to a data frame
lasso_coefficients_df <- as.data.frame(as.matrix(lasso_coefficients))
lasso_coefficients_df$Predictor <- rownames(lasso_coefficients_df)

# Reorder columns to make predictor names the first column
lasso_coefficients_df <- lasso_coefficients_df %>%
  select(Predictor, everything())

# Print the coefficients table using kable
kable(lasso_coefficients_df, digits = 4, caption = "Lasso Regression Coefficients", format = "markdown")


```


## Comparison

In this analysis, four logistic regression models—Full Model, Stepwise AIC Model, BIC Model, and Lasso Model are compared using ROC (Receiver Operating Characteristic) curves and AUC (Area Under the Curve) to assess their predictive performance.  The optimal cutoff for each model using Youden’s Index is determined, which maximizes the balance between sensitivity and specificity. By evaluating these models through both graphical and numerical metrics, insights is gained into which model provides the best classification performance for predicting the outcome variable.

```{r}
# Full Model
full_model <- glm(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp, data = imputed_nhanes, family = binomial)

# Stepwise AIC Model
stepwise_model <- stepAIC(full_model, direction = "both", trace = FALSE)

# BIC Model
bic_model <- stepAIC(full_model, direction = "both", k = log(nrow(imputed_nhanes)), trace = FALSE)

# Lasso Model
x <- model.matrix(current.smk ~ age + race + log_body.weight + log_BMI + obese + height + log_avg.sbp + avg.dbp + log_cholestrol + hbp , data = imputed_nhanes)[, -1]  # Exclude intercept
y <- imputed_nhanes$current.smk
lasso_cv <- cv.glmnet(x, y, family = "binomial", alpha = 1)
lasso_model <- glmnet(x, y, family = "binomial", alpha = 1, lambda = lasso_cv$lambda.min)

# Get predicted probabilities
imputed_nhanes$full_pred <- predict(full_model, type = "response")
imputed_nhanes$stepwise_pred <- predict(stepwise_model, type = "response")
imputed_nhanes$bic_pred <- predict(bic_model, type = "response")
imputed_nhanes$lasso_pred <- predict(lasso_model, newx = x, type = "response")[,1]

# Create ROC curves
full_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$full_pred)
stepwise_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$stepwise_pred)
bic_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$bic_pred)
lasso_roc <- roc(imputed_nhanes$current.smk, imputed_nhanes$lasso_pred)

# Calculate optimal cutoff using Youden's Index
optimal_cutoff_full <- full_roc$specificities[which.max(full_roc$sensitivities + full_roc$specificities - 1)]
optimal_cutoff_stepwise <- stepwise_roc$specificities[which.max(stepwise_roc$sensitivities + stepwise_roc$specificities - 1)]
optimal_cutoff_bic <- bic_roc$specificities[which.max(bic_roc$sensitivities + bic_roc$specificities - 1)]
optimal_cutoff_lasso <- lasso_roc$specificities[which.max(lasso_roc$sensitivities + lasso_roc$specificities - 1)]

# Print optimal cutoffs
cat("Optimal cutoff for Full Model:", optimal_cutoff_full, "\n")
cat("Optimal cutoff for Stepwise AIC Model:", optimal_cutoff_stepwise, "\n")
cat("Optimal cutoff for BIC Model:", optimal_cutoff_bic, "\n")
cat("Optimal cutoff for Lasso Model:", optimal_cutoff_lasso, "\n")

# Plot all ROC curves
ggplot() +
  geom_line(aes(x = full_roc$specificities, y = full_roc$sensitivities, color = "Full Model")) +
  geom_line(aes(x = stepwise_roc$specificities, y = stepwise_roc$sensitivities, color = "Stepwise AIC")) +
  geom_line(aes(x = bic_roc$specificities, y = bic_roc$sensitivities, color = "BIC Model")) +
  geom_line(aes(x = lasso_roc$specificities, y = lasso_roc$sensitivities, color = "Lasso Model")) +
  labs(title = "ROC Curve Comparison", x = "1 - Specificity", y = "Sensitivity") +
  scale_color_manual(values = c("red", "blue", "green", "purple", "orange")) +
  theme_minimal()

# AUC values
auc_values <- data.frame(
  Model = c("Full Model", "Stepwise AIC", "BIC Model", "Lasso Model"),
  AUC = c(auc(full_roc), auc(stepwise_roc), auc(bic_roc), auc(lasso_roc))
)

print(auc_values)
```

All models have very similar AUC. In turn, the more parsimonious model, BIC, is recommended. 

# CONCLUSION

In this analysis, a comprehensive approach was taken to explore and model the NHANES dataset. Beginning with Exploratory Data Analysis (EDA), patterns and missing values were identified, leading to the implementation of Multiple Imputation by Chained Equations (MICE) to handle missing data. Feature engineering and transformation were applied to enhance model performance, followed by feature selection to refine predictor variables. Both linear and logistic regression models were developed and evaluated, including full models, stepwise selection methods, and regularized regression techniques. Ultimately, the Bayesian Information Criterion (BIC) emerged as the best-performing model for both linear and logistic regression, demonstrating its effectiveness in balancing model complexity and predictive accuracy.
